STXXL: Standard Template Library for XXL Data Sets

Roman Dementiev

Institute for Theoretical Computer Science, Algorithmics II University of Karlsruhe, Germany

March 19, 2007

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 1/43 Outline

1 Introduction

2 Experimental Parallel Disk System

3 The STXXL Library

4 Engineering Algorithms for Large Graphs

5 Engineering Large Suffix Array Construction

6 Porting Algorithms to External Memory

7 Summary

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 2/43 Large Data Sets

Where they come from Geographic information systems: GoogleEarth, NASA’s World Wind Computer graphics: visualize huge scenes Billing systems: phone calls, traffic Analyze huge networks: Internet, phone call graph Text collections: , , etc.

How to process them Buy a TByte main memory? expensive or impossible Here: how to process very large data sets cost-efficiently

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 3/43 Large Data Sets

Where they come from Geographic information systems: GoogleEarth, NASA’s World Wind Computer graphics: visualize huge scenes Billing systems: phone calls, traffic Analyze huge networks: Internet, phone call graph Text collections: , , etc.

How to process them

Buy a TByte main memory? expensive or impossible Here: how to process very large data sets cost-efficiently

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 3/43 RAM Model vs. Real Computer

A straightforward solution Use small main memory

keep data on cheap disks, “unlimited” virtual memory O(1) registers Theory: should work (von Neumann (RAM) model) ALU 1 word = O(log n) bits I Unit cost memory access freely programmable large memory I No locality of reference Practice: terrible performance 6 I Random hard disk accesses are 10 slower than main memory accesses I Strong locality of reference

⇒ I/O is the bottleneck

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 4/43 The Parallel Disk Model (PDM)

[Vitter&Shriver’1994]

Parameters

 Input size N    CPU  Memory size M  N  Block size B    The number of disks  M   Performance DB Minimize the number of I/O steps In an I/O step try transfer D blocks 1 2... D Minimize the number of CPU instructions

I/O-efficient algorithms ≡ external memory algorithms

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 5/43 Engineering Parallel Disk Systems

2x Xeon 4 Threads 400x64 Mb/s Intel 1 GB E7500 DDR Chipset 128 RAM 2x64x66 Mb/s PCI−Busses 4x2x100 MB/s Controller Channels 8x45 8x80 MB/s GB

Challenges Cheap case for ≥ 8 hard disks Many fast PCI slots for ATA controllers (no bus bottlenecks) Wide Parallel ATA cables worsen airflow (later system use Serial ATA) File system scalability: very large files ⇒ 375 MB/s (≈ 98% of the peak) for about 3000 Euro in 2002 ⇒ Other systems: 10 disks = 640 MB/s, 4 disks = 214 MB/s

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 6/43 The STXXL Library

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 7/43 I/O-Efficient Software Libraries

Advantages Abstract away the technical details of I/O Provide implementation of basic I/O-eff. algorithms and data structures ⇒ algorithm engineering

Existing Libraries TPIE: many (geometric) search data structures LEDA-SM: extension of LEDA (discontinued) + Good demonstrations of the external memory concepts – Do not implement many features that speed up I/O-efficient algorithms

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 8/43 I/O-Efficient Software Libraries

Advantages Abstract away the technical details of I/O Provide implementation of basic I/O-eff. algorithms and data structures ⇒ Boost algorithm engineering

Existing Libraries TPIE: many (geometric) search data structures LEDA-SM: extension of LEDA (discontinued) + Good demonstrations of the external memory concepts – Do not implement many features that speed up I/O-efficient algorithms

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 8/43 The STXXL Library

STL – ++ Standard Template Library, implements basic containers (maps, sets, priority queues, etc.) and algorithms (quicksort, mergesort, selection, etc.) STXXL : Standard Template Library for XXL Data Sets http://stxxl.sourceforge.net containers and algorithms that can process huge volumes of data that only fit on disks (I/O-efficient)

I Compatible with STL I Performance–oriented

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 9/43 STXXL Features

Dementiev, Kettner, Sanders STXXL: The Standard Template Library for Extra Large Data Sets. ESA 2005, 13th Annual European Symposium on Algorithms

    CPU  

    M   Transparent parallel disk support DB 1 2... D                     Handles very large problems (up to petabytes)              SORT SCAN SCAN    Pipelining saves many I/Os    Explicitly overlaps I/O and computation Avoids superfluous copying

I in OS I/O subsystem and the library itself Compatible with STL – C++ Standard Template Library

I Short development times I Reuse of STL code (e.g. selection alg.)

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 10/43 STXXL Design

   Applications           STL−user layer Streaming layer   vector, stack, set  Containers:  priority_queue, map Pipelined sorting,   Algorithms: sort, for_each, merge zero−I/O scanning         Block management (BM) layer   typed block, block manager, buffered streams,   block prefetcher, buffered block writer TXXL S      Asynchronous I/O primitives (AIO) layer    files, I/O requests, disk queues,   completion handlers  

     

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 11/43 STXXL Design: AIO Layer

Hides details of async. I/O Applications

(portability, user-friendly) STL−user layer Streaming layer vector, stack, set Containers: priority_queue, map Pipelined sorting, Implementations for Algorithms: sort, for_each, merge zero−I/O scanning

Linux/MacOSX/BSD/Solaris Block management (BM) layer and Windows systems typed block, block manager, buffered streams, block prefetcher, buffered block writer

  Asynchrony provided by   Asynchronous I/O primitives (AIO) layer    POSIX threads or Boost   files, I/O requests, disk queues,   Threads completion handlers     Unbuffered I/O support: Operating System more control over I/O

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 12/43 STXXL Design: BM Layer

Applications

Block abstraction STL−user layer Streaming layer vector, stack, set Containers: priority_queue, map Pipelined sorting, Parallel disk model Algorithms: sort, for_each, merge zero−I/O scanning      Block management (BM) layer (Randomized) striping and     typed block, block manager, buffered streams,  cycling  block prefetcher, buffered block writer     Parallel disk buffered writing Asynchronous I/O primitives (AIO) layer and optimal prefetching files, I/O requests, disk queues, completion handlers [Hutchinson&Sanders&Vitter01] Operating System

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 13/43 STXXL User Layers

Applications

     STL−user layer Streaming layer    STL-user layer: compatible vector, stack, set   Containers: Pipelined sorting, priority_queue, map   Algorithms: sort, for_each, merge zero−I/O scanning with STL, vector, stack,     queue, deque, priority queue, Block management (BM) layer typed block, block manager, buffered streams, map, sorting, scanning block prefetcher, buffered block writer

Streaming layer: Asynchronous I/O primitives (AIO) layer files, I/O requests, disk queues, programming with pipelining completion handlers

Operating System

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 14/43 Example: Generate Random Graph with STXXL

1 stxxl::vector Edges(10000000000ULL); 2 std::generate(Edges.begin(),Edges.end(),random_edge()); 3 stxxl::sort(Edges.begin(),Edges.end(),edge_cmp(),512*1024*1024); 4 stxxl::vector::iterator NewEnd = std::unique(Edges.begin(),Edges.end()); 5 Edges.resize(NewEnd - Edges.begin());

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 15/43 Streaming Layer and Pipelining

DATA DATA

EM algorithm ⇒ data flow through a DAG    SCAN  Feed output data stream directly to the   consumer algorithm

  SORT A new iterator-like interface for EM algorithms     Basic pipelined implementations (file, sorting

  nodes, etc.) provided by STXXL   SCAN   Saves many I/Os (factor 2–3) in many EM algorithms

DATA

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 16/43 Parallel Disk Sorting: an Important Part of STXXL

Sorting is the core routine of almost every I/O-eff. algorithm.

We engineer a parallel disk sorting algorithm and implementation which guarantees:   1 N N Optimal I/O volume sort(N) = O DB logM/B B 2 Almost perfect overlapping of I/O and computation

Dementiev and Sanders Asynchronous Parallel Disk Sorting. SPAA 2003, 15th ACM Symposium on Parallelism in Algorithms and Architectures

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 17/43 I/O-Efficient Multiway Merge Sort

The Algorithm 1 Sort input chunks of size Θ(M) (aka runs) 2 Θ(M/B)-way merge runs until only one sorted run left  N N  ⇒ O B logM/B M I/Os

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 18/43 I/O-Efficient Multiway Merge Sort

The Challenges 1 Overlapping of I/O and computation: running time ≈ max(IOtime(N),CPUtime(N))

 N N  2 B logM/B M Disk balancing: IOtime(N) ≈ O D

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 19/43 Run Formation: Easy

Two threads cooperate to build k runs of size M/2:

post a read request for runs 1 and 2 thread A: | thread B: for r:=1 to k do | for r:=1 to k-2 do wait until | wait until run r is read | run r is written sort run r | post a read for run r+2 post a write for run r |

control flow in thread A control flow in thread B time

1 2 3 4 k−1 k read ... I/O Stripe data over the disks 1 2 3 4 bound ... k−1 k sort case 1 2 3 4 k−1 k write ... TRF (N) = M N 1 2 3 4 k−1 k max(2Tsort ( 2 ) M ,TIO (N)) + Tstartup read ... compute 1 2 3 4 k−1 k bound ... sort case 1 2 3 4 k−1 k write ...

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 21/43 External Multiway Merging

The smallest element of a block is a trigger

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 22/43 Overlaping I/O and Merging

Prediction of delay between issuing two reads is not easy:

Solution (sketch) 1 Integrate k + 3D overlap buffers 2 Special I/O thread strategy ≥ DB elements in write buffer ⇒ output step < DB elements in write buffer AND D overlap buffers avail. ⇒ input step

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 23/43 Overlaping I/O and Merging

Prediction of delay between issuing two reads is not easy:

Solution (sketch) 1 Integrate k + 3D overlap buffers 2 Special I/O thread strategy ≥ DB elements in write buffer ⇒ output step < DB elements in write buffer AND D overlap buffers avail. ⇒ input step

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 23/43 Overlaping I/O and Merging

Prediction of delay between issuing two reads is not easy:

Solution (sketch) 1 Integrate k + 3D overlap buffers 2 Special I/O thread strategy ≥ DB elements in write buffer ⇒ output step < DB elements in write buffer AND D overlap buffers avail. ⇒ input step

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 23/43 Disk Scheduling

Use randomized striping allocation [Vitter&Hutchinson’2001] Prefetch buffer of m = O(D) blocks: In almost every input step, (1 − O(D/m))D blocks from prefetch sequence δ can be fetched [Dementiev&Sanders’2003] Result For any ε, m = Θ(D/ε) prefetch buffers, merging k sequences with a total N0 elements can be implemented in time 2LN0 0 max( (1−ε)DB ,τN ) + Tstartup, where L is the time to read of write a block and τ is the time to merge an element

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 24/43 Disk Scheduling

Use randomized striping allocation [Vitter&Hutchinson’2001] Prefetch buffer of m = O(D) blocks: In almost every input step, (1 − O(D/m))D blocks from prefetch sequence δ can be fetched [Dementiev&Sanders’2003] Result For any ε, m = Θ(D/ε) prefetch buffers, merging k sequences with a total N0 elements can be implemented in time 2LN0 0 max( (1−ε)DB ,τN ) + Tstartup, where L is the time to read of write a block and τ is the time to merge an element

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 24/43 STXXL Performance: Sorting

2GHz Xeon, 1GByte RAM, 2 GByte input, 32-bit keys runs of size 256 MByte, g++ 3.2

800 LEDA-SM LEDA-SM Soft-RAID TPIE 700 200 TPIE Soft-RAID comparison based Soft-RAID 600

500 150

400 100 sort time [s]

300 sort time [s]

200 50 100

0 0 16 32 64 128 256 512 1024 16 32 64 128 256 512 1024 element size [byte] element size [byte] Single disk Eight disks I/O bandwidth of 45.4 MB/s I/O bandwidth of 315 MB/s = 95% of peak bandwidth of the disk

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 25/43 Sorting Performance: I/O Bottleneck Disappears Changing element size

400 run formation 16 GByte input, 32-bit keys, D = 8, 350 merging I/O wait in merge phase I/O wait in run formation phase runs of size 256 MByte, g++ 3.2 300 elem. size ≥ 64 ⇒ merging is I/O 250

200 bound time [s] 150 elem. size ≥ 128 ⇒ run formation 100 is I/O bound 50 For small elements I/O is not the 0 16 32 64 128 256 512 1024 bottleneck element size [byte]

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 26/43 STXXL Performance

All other STXXL algorithms and data structures are benchmarked in my PhD thesis against LEDA-SM, TPIE and Berkley DB (B-tree) STXXL implementations outperform or compete with the opponents

60 2500 50 700 250 LEDA-SM insert LEDA-SM insert LEDA-SM LEDA-SM Stxxl insert D=1 Stxxl delete D=1 Stxxl intermixed D=1 500 3000 LEDA-SM delete LEDA-SM delete Stxxl Stxxl Stxxl insert D=2 Stxxl delete D=2 Stxxl intermixed D=2 Stxxl insert 50 Stxxl insert 600 Stxxl insert D=4 Stxxl delete D=4 Stxxl intermixed D=4 Stxxl delete Stxxl delete 2000 40 200 2500 500 400 40 1500 30 150 2000 400 300 30 1500 300 1000 20 100 200

ns per operation 20 ns per operation ns per operation ns per operation ns per operation

1000 bytes per operation bytes per operation 200

500 10 50 100 500 10 100

0 0 0 0 0 0 0 220 222 224 226 228 230 232 220 222 224 226 228 230 232 220 222 224 226 228 230 232 220 222 224 226 228 230 232 220 222 224 226 228 230 232 234 220 222 224 226 228 230 232 234 220 222 224 226 228 230 232 234 n n n n n n n

18000 3 BDB btree BDB btree BDB btree 14 BDB btree BDB btree 25000 BDB btree 5 TPIE btree 30000 TPIE btree 16000 TPIE btree TPIE btree stxxl::map TPIE btree stxxl::map stxxl::map stxxl::map stxxl::map 2.5 stxxl::map 14000 12 25000 20000 4 12000 10 2 20000 10000 15000 3 8 1.5 15000 8000

s per locate 6 s per deletion s per insertion µ 10000 overuse factor µ

2 µ s per input record 6000 1 µ

10000 s per scanned record

µ 4 4000 1 5000 5000 0.5 2000 2

0 0 0 0 0 0 223 224 225 226 227 228 229 223 224 225 226 227 228 229 223 224 225 226 227 228 229 223 224 225 226 227 228 229 223 224 225 226 227 228 229 223 224 225 226 227 228 229 n n n n n n

55 170 275 70 60 160 50 150 65 55 250 55 140 45 60 50 225 50 130 55 40 120 STXXL gs2 32 delete 45 200 50 45 110 STXXL gs2 32 insert35 STXXL n 32 delete D=1 gs2 4 delete 40 175 45 STXXL n 32 insert LEDA-SM 32 delete STXXL gs2 4 insert D=1 gs2 4 insert 100 D=1 gs2 32 delete 40 STXXL gs2 4 delete LEDA-SM 32 insert30 TPIE mmap 32 delete STXXL n 4 insert D=1 gs2 32 insert 90 D=2 gs2 4 delete 40 35 STXXL n 4 delete 150 TPIE mmap 32 insert TPIE ufs 32 delete LEDA-SM 4 insert D=2 gs2 4 insert D=2 gs2 32 delete 35 LEDA-SM 4 delete 80 TPIE ufs 32 insert 25 D=2 gs2 32 insert D=4 gs2 4 delete 35 TPIE mmap 4 insert 30 MByte/s MByte/s TPIE mmap 4 delete 125 D=4 gs2 4 insert 70 30 TPIE ufs 4 insert MByte/s D=4 gs2 32 delete MByte/s 30 TPIE ufs 4 delete 20 MByte/s D=4 gs2 32 insert 60 25 100 25 MByte/s 25 50 15 20 20 75 40 20 10 30 15 15 50 15 20 10 5 10 25 10 5 10 0 0 5 0 0 5 0 0

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 27/43 Engineering Algorithms for Large Graphs

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 28/43 Engineering an I/O-efficient MST Algorithm

1 5 2 9 7 2 4 3

Challenge Can we compute Minimum Spanning Trees (Forests) for really huge graphs?

Dementiev, Sanders, Schultes, and Sibeyn Engineering an External Memory Minimum Spanning Tree Algorithm. TCS 2004: 3rd IFIP International Conference on Theoretical Computer Science

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 29/43 A Practical Approach

Sketch of the algorithm sweep line 1 Reduce the node set V merging nodes output and finding some MST edges until relink |V | = O(M) ... u v 2 Run Kruskal’s algorithm keeping forests in internal memory (Union-Find) relink

Two implementation variants Node reduction using stxxl::priority_queue: very simple, only 12 lines of C++/STXXL code, CPU-bound Bucket version: based on stxxl::stacks, linear internal work

Results Computed MSTs for 100 GByte graphs in 8 hours on a PC Only 2–5 times slower than a good internal algorithm

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 30/43 A Practical Approach

Sketch of the algorithm sweep line 1 Reduce the node set V merging nodes output and finding some MST edges until relink |V | = O(M) ... u v 2 Run Kruskal’s algorithm keeping forests in internal memory (Union-Find) relink

Two implementation variants Node reduction using stxxl::priority_queue: very simple, only 12 lines of C++/STXXL code, CPU-bound Bucket version: based on stxxl::stacks, linear internal work

Results Computed MSTs for 100 GByte graphs in 8 hours on a PC Only 2–5 times slower than a good internal algorithm

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 30/43 A Practical Approach

Sketch of the algorithm sweep line 1 Reduce the node set V merging nodes output and finding some MST edges until relink |V | = O(M) ... u v 2 Run Kruskal’s algorithm keeping forests in internal memory (Union-Find) relink

Two implementation variants Node reduction using stxxl::priority_queue: very simple, only 12 lines of C++/STXXL code, CPU-bound Bucket version: based on stxxl::stacks, linear internal work

Results Computed MSTs for 100 GByte graphs in 8 hours on a PC Only 2–5 times slower than a good internal algorithm

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 30/43 I/O-Efficient Breadth First Search

Ajwani, Dementiev and Meyer A Computational Study of External-Memory BFS Algorithms. SODA 2006, ACM Symposium on Discrete Algorithms 4 1 3 5 Study two I/O-efficient BFS algorithms 0 2 4 (MunagalaRanade and MehlhornMeyer) 1 3 Use STXXL pipelining i−1 i 4 Results BFS of a real huge WWW crawl graph (130 · 106 nodes, 1.4 · 109 edges) in about 2 hours on a PC

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 31/43 I/O-Efficient Breadth First Search

Ajwani, Dementiev and Meyer A Computational Study of External-Memory BFS Algorithms. SODA 2006, ACM Symposium on Discrete Algorithms 4 1 3 5 Study two I/O-efficient BFS algorithms 0 2 4 (MunagalaRanade and MehlhornMeyer) 1 3 Use STXXL pipelining i−1 i 4 Results BFS of a real huge WWW crawl graph (130 · 106 nodes, 1.4 · 109 edges) in about 2 hours on a PC

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 31/43 I/O-Efficient Breadth First Search

Ajwani, Dementiev and Meyer A Computational Study of External-Memory BFS Algorithms. SODA 2006, ACM Symposium on Discrete Algorithms 4 1 3 5 Study two I/O-efficient BFS algorithms 0 2 4 (MunagalaRanade and MehlhornMeyer) 1 3 Use STXXL pipelining i−1 i 4 Results BFS of a real huge WWW crawl graph (130 · 106 nodes, 1.4 · 109 edges) in about 2 hours on a PC

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 31/43 Engineering Algorithms for Large Graphs

In my PhD thesis Maximal Independent Set Connected Components and Spanning Trees Listing All Triangles (Heuristics for) Graph Coloring

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 32/43 Engineering Large Suffix Array Construction

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 33/43 Engineering Large Suffix Array Construction

Suffix Array: SA[i] is the starting pos of the i-th smallest suffix of input S

Example: b a S = [b,a,n,a,n,a] a ana n anana SA = [5,3,1,0,4,2] a banana n na Applications: full-text index, compression a nana

Our Work Design, implement, evaluate several new I/O-efficient algorithms Apply pipelining to external memory suffix array construction

Dementiev, Kärkkäinen, Mehnert, Sanders Better External Memory Suffix Array Construction. JEA and ALENEX05: Algorithm Engineering and Experiments

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 34/43 Engineering Large Suffix Array Construction

Suffix Array: SA[i] is the starting pos of the i-th smallest suffix of input S

Example: b a S = [b,a,n,a,n,a] a ana n anana SA = [5,3,1,0,4,2] a banana n na Applications: full-text index, compression a nana

Our Work Design, implement, evaluate several new I/O-efficient algorithms Apply pipelining to external memory suffix array construction

Dementiev, Kärkkäinen, Mehnert, Sanders Better External Memory Suffix Array Construction. JEA and ALENEX05: Algorithm Engineering and Experiments

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 34/43 Engineering Large Suffix Array Construction 2

Considered algorithms Doubling, a-Tupling (quadrupling), doubling+discarding, quadrupling+discarding, difference cover (DC3 aka Skew, I/O-optimal)

Input instances: random and real world Concatenation of a random string (make heuristics look bad) Gutenberg text collection (≈ 3 GBytes) Human genome (small alphabet, ≈ 3 GBytes) HTML (text + tags, ≈ 4 GBytes) (C++) source code (≈ 500 MBytes)

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 35/43 Engineering Large Suffix Array Construction 3

Results Pipelining saves a factor of 3 in I/O volume ⇒ a speedup of 1.9-2.4 for D = 1 Optimal DC3 outperforms all opponents on all inputs Suffix array of a 4 GByte input can be computed in a few hours on a PC with a small main memory ⇒ Very price-efficient

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 36/43 Porting Algorithms to External Memory

Replace few underlying non-I/O-efficient algorithms by corresponding I/O-efficient versions We have applied this technique obtaining algorithms for Bipartiteness test (aka 2-coloring): O(sort(|E| + |V |)) I/Os 5-Coloring Planar Graphs: O(sort(|E| + |V |)) I/Os Finding 1/2-Approximation of Maximum Weighted Matching: O(sort(|E| + |V |)) I/Os Finding Perfect Matchings in Bipartite Multigraphs:

O(sort(|E| + |V |)log2(|E|)) I/Os

⇒ Can be easily implemented with STXXL

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 37/43 Summary

Engineering from the bottom to the top: many disks CPU-bound look at internal algorithms, RAID-0 suboptimal Pipelining to save I/Os, overlap I/O and computation, easy to use library, abstraction, rapid prototyping Controlled unbuffered asynchronous I/O, scalable file systems

Bottleneck-free hardware I/O-subsystem with parallel disks

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 38/43 Summary

Engineering from the bottom to the top: many disks CPU-bound look at internal algorithms, RAID-0 suboptimal Pipelining to save I/Os, overlap I/O and computation, easy to use library, abstraction, rapid prototyping Controlled unbuffered asynchronous I/O, scalable file systems

Bottleneck-free hardware I/O-subsystem with parallel disks

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 38/43 Summary

Engineering from the bottom to the top: many disks CPU-bound look at internal algorithms, RAID-0 suboptimal Pipelining to save I/Os, overlap I/O and computation, easy to use library, abstraction, rapid prototyping Controlled unbuffered asynchronous I/O, scalable file systems

Bottleneck-free hardware I/O-subsystem with parallel disks

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 38/43 Summary

Engineering from the bottom to the top: many disks CPU-bound look at internal algorithms, RAID-0 suboptimal Pipelining to save I/Os, overlap I/O and computation, easy to use library, abstraction, rapid prototyping Controlled unbuffered asynchronous I/O, scalable file systems

Bottleneck-free hardware I/O-subsystem with parallel disks

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 38/43 Conclusion

The STXXL library high-performance (parallel disks, pipelining, overlapping of I/O and computation) easy to use (STL-compatible) STXXL applications: solve very large problem instances externally using a low cost hardware in record time ⇒ Price-efficient

Outlook Possible efficiency improvements:

I pipelining+overlapping I parallel processing (Multi-Core STL) I pipelining+task-based parallelism

Submit to the BOOST libraries

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 39/43 Conclusion

The STXXL library high-performance (parallel disks, pipelining, overlapping of I/O and computation) easy to use (STL-compatible) STXXL applications: solve very large problem instances externally using a low cost hardware in record time ⇒ Price-efficient

Outlook Possible efficiency improvements:

I pipelining+overlapping I parallel processing (Multi-Core STL) I pipelining+task-based parallelism

Submit to the BOOST libraries

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 39/43 Active STXXL Users We Know About

1 University of Karlsruhe, Germany (text processing, graph algorithms, practical courses) 2 Max-Planck-Institut für Informatik, Germany (bio-informatics, graph algorithms) 3 University of Rome “La Sapienza”, Italy (connected components) 4 University of Texas at Austin, USA (Gaussian elimination) 5 Bitplane AG, Switzerland (visualization and analysis of 3D and 4D microscopic images) 6 Philips Research, The Netherlands (differential cryptographic analysis) 7 Dalhousie University, Canada (N-gram extraction) 8 Florida State University, USA (construction of Voronoi diagrams) 9 Montefiore Institute, Belgium (big sparse matrices) 10 University of British Columbia, Canada (topology analysis of large networks) 11 Bayes Forecast, Spain (statistics and time series analysis) 12 Indian Institute of Science in Bangalore, India (suffix array construction) 13 Rensselaer Polytechnic University, USA (suffix array construction) 14 Institut français du pèrole, France (analysis of seismic files) 15 Northumbria University, UK (search trees) 16 University of Trento, Italy (text compression) 17 Norwegian University of Science and Technology in Trondheim, Norway (suffix array construction)

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 40/43 High-Performance System for DIMACS

ema.rutgers.edu = External Memory Algorithms 8GB RAM 2x Dual Core 2.3 GHz Xeon = 4 processor SMP 400 MByte/s disk I/O bandwidth

switch out 2 x 1GBit Ethernet

dimacspc1 dimacspc2 dimacspc3 dimacspc4 dimacspc5 dimacspc6 Each PC: 2 GByte RAM TOTAL DISK WORKING SPACE: Core 2 Duo 2.1 Ghz processor = 2 processor SMP 2 TBytes 100 MByte/s disk I/O bandwidth

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 41/43 High-Performance System for DIMACS: Software

Libraries External Memory I/O: STXXL EMA⇔PCs communication: MPI (Message Passing Interface) Boost libraries

Operating systems EMA: 64-bit Debian PCs: dual boot – 64-bit Debian Linux or Windows XP x64

Access to EMA and PCs from Linux/: ssh, scp from Windows: SSH secure shell client for Windows, Putty

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 42/43 High-Performance System for DIMACS: Plans

Handle graphs with a billion edges Store many interesting huge (weighted) graph instances Design and implement efficient data structures for huge graphs Fast graph query mechanisms available for DIMACS researchers

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 43/43 Final Slide

Thank you for your attention!

Roman Dementiev (University of Karlsruhe) STXXL March 19, 2007 44/43