Databases & External Memory
Total Page:16
File Type:pdf, Size:1020Kb
Databases & External Memory Indexes, Write Optimization, and Crypto-searches Michael A. Bender Martin Farach-Colton Tokutek & Stony Brook Tokutek & Rutgers What’s a Database? DBs are systems for: • Storing data • Querying data. 9 count (*) where a<120; 5 809 99 20 8 202 16 1 14 STOC ’12 Tutorial: Databases and External Memory What’s a Database? DBs are systems for: • Storing data • Querying data. DBs can have SQL interface or something else. • We’ll talk some about so-called NoSQL systems. Data consists of a set of key,value pairs. • Each value can consist of a well defined tuple, where each component is called a field (e.g. relational DBs). Big Data. • Big data = bigger than RAM. STOC ’12 Tutorial: Databases and External Memory What’s this database tutorial? Traditionally: • Fast queries ⇒ sophisticated indexes and slow inserts. • Fast inserts ⇒ slow queries. 9 count (*) where a<120; 5 809 99 20 8 202 16 1 14 We want fast data ingestion + fast queries. STOC ’12 Tutorial: Databases and External Memory What’s this database tutorial? Database tradeoffs: • There is a query/insertion tradeoff. ‣ Algorithmicists mean one thing by this claim. ‣ DB users mean something different. • We’ll look at both. Overview of Tutorial: • Indexes: The DB tool for making queries fast • The difficulty of indexing under heavy data loads • Theory and Practice of write optimization • From data structure to database • Algorithmic challenges from maintaining two indexes STOC ’12 Tutorial: Databases and External Memory Row, Index, and Table a b c Row 100 5 45 • Key,value pair 101 92 2 • key = a, value = b,c 156 56 45 Index 165 6 2 • Ordering of rows by key 198 202 56 • Used to make queries fast 206 23 252 256 56 2 Table 412 43 45 • Set of indexes create table foo (a int, b int, c int, primary key(a)); STOC ’12 Tutorial: Databases and External Memory An index can select needed rows a b c 100 5 45 101 92 2 156 56 45 165 6 2 198 202 56 206 23 252 256 56 2 412 43 45 count (*) where a<120; STOC ’12 Tutorial: Databases and External Memory An index can select needed rows a b c 100 5 45 100 5 45 101 92 2 101 92 2 } 156 56 45 165 6 2 198 202 56 206 23 252 2 256 56 2 412 43 45 count (*) where a<120; STOC ’12 Tutorial: Databases and External Memory No good index means slow table scans a b c 100 5 45 101 92 2 156 56 45 165 6 2 198 202 56 206 23 252 256 56 2 412 43 45 count (*) where b>50 and b<100; STOC ’12 Tutorial: Databases and External Memory No good index means slow table scans a b c 100 5 45 100 5 45 101 92 2 101 92 2 156 56 45 156 56 45 165 6 2 165 6 2 1230 198 202 56 198 202 56 206 23 252 206 23 252 256 56 2 256 56 2 412 43 45 412 43 45 count (*) where b>50 and b<100; STOC ’12 Tutorial: Databases and External Memory You can add an index a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 alter table foo add key(b); STOC ’12 Tutorial: Databases and External Memory A selective index speeds up queries a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 count (*) where b>50 and b<100; STOC ’12 Tutorial: Databases and External Memory A selective index speeds up queries a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 56 156 198 202 56 56 156 56 256 206 23 252 56 156256 } 92 101 256 56 2 5692 256101 412 43 45 20292 101198 3 count (*) where b>50 and b<100; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow 56 156 56 256 a b c b a 92 101 Selecting 100 5 45 5 100 202 198 fast b: on 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow 56 156 56 256 a b c b a 92 101 Selecting 100 5 45 5 100 202 198 fast b: on 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 summing c: slow c: summing 412 43 45 202 198 for info Fetching sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow 56 156 56 256 a b c b a 92 101 Selecting 100 5 45 5 100 202 198 fast b: on 101 92 2 6 165 156 56 45 23 206 156 56 45 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 summing c: slow c: summing 412 43 45 202 198 for info Fetching sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow 56 156 56 256 a b c b a 92 101 Selecting 100 5 45 5 100 202 198 fast b: on 101 92 2 6 165 156 56 45 23 206 156 56 45 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 summing c: slow c: summing 412 43 45 202 198 for info Fetching sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow 56 156 56 256 a b c b a 92 101 Selecting 100 5 45 5 100 202 198 fast b: on 101 92 2 6 165 156 56 45 23 206 156 56 45 165 6 2 43 412 256 56 2 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 summing c: slow c: summing 412 43 45 202 198 for info Fetching sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Selective indexes can still be slow 56 156 56 256 a b c b a 92 101 Selecting 100 5 45 5 100 202 198 fast b: on 101 92 2 6 165 156 56 45 23 206 156 56 45 165 6 2 43 412 256 56 2 198 202 56 56 156 101 92 2 Poor data locality data Poor 206 23 252 56 256 198 202 56 256 56 2 92 101 summing c: slow c: summing 412 43 45 202 198 for info Fetching 105 sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Covering indexes speed up queries a b c b,c a 100 5 45 5,45 100 101 92 2 6,2 165 156 56 45 23,252 206 165 6 2 43,45 412 198 202 56 56,2 256 206 23 252 56,45 156 256 56 2 92,2 101 412 43 45 202,56 198 alter table foo add key(b,c); sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory Covering indexes speed up queries 56,56,22 256 a b c b,c a 56,56,4545 156 100 5 45 5,45 100 92,92,22 101 101 92 2 6,2 165 202,202,5656 198 156 56 45 23,252 206 165 6 2 43,45 412 198 202 56 56,2 256 105 206 23 252 56,45 156 256 56 2 92,2 101 412 43 45 202,56 198 alter table foo add key(b,c); sum(c) where b>50; STOC ’12 Tutorial: Databases and External Memory DB Performance and Indexes Read performance depends on having the right indexes for a query workload. • We’ve scratched the surface of index selection. • And there’s interesting query optimization going on. Write performance depends on speed of maintaining those indexes. Next: how to implement indexes and tables. STOC ’12 Tutorial: Databases and External Memory An index is a dictionary Dictionary API: maintain a set S subject to • insert(x): S ← S ∪ {x} • delete(x): S ← S - {x} • search(x): is x ∊ S? • successor(x): return min y > x s.t. y ∊ S • predecessor(y): return max y < x s.t. y ∊ S STOC ’12 Tutorial: Databases and External Memory A table is a set of indexes A table is a set of indexes with operations: • Add index: add key(f1,f2,...); • Drop index: drop key(f1,f2,...); • Add column: adds a field to primary key value.