class 5 column stores 2.0 prof. Stratos Idreos
HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ worth thinking about…
what just happened? where is my data?
email, cloud, social media, …
can we design systems that let us know what is going on?
CS165, Fall 2018 Stratos Idreos cool papers 2.0
The Case for RodentStore: An Adaptive, Declarative Storage System Philippe Cudré-Mauroux, Eugene Wu, Samuel Madden In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2009
Abstraction Without Regret in Database Systems Building: a Manifesto Christoph Koch IEEE Data Eng. Bull. 37(1): 70-79 (2014) dbTouch: Analytics at your Fingertips Stratos Idreos and Erietta Liarou In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2013
CS165, Fall 2018 Stratos Idreos design doc think, design, create 1-2 page PDF doc and ask for feedback mandatory M1-M2, optional afterwards
do not worry about perfection: fail fast wrong ideas ok if you eventually find out they are wrong :) (holds for midterms as well)
CS165, Fall 2018 Stratos Idreos my head registers ~0
2x this room on chip cache 1 min
10x this building on board cache 10 min
100x New York memory 1.5 hours
Jim Gray, IBM, Tandem, DEC, Microsoft 100Kx Pluto ACM Turing award disk 2 years ACM SIGMOD Edgar F. Codd Innovations Award
CS165, Fall 2018 Stratos Idreos the way we store data defines the possible (efficient) access methods
CS165, Fall 2018 Stratos Idreos slotted page
free_offset, N, offset1-length1, offset2-lenght2,…
scan null free space update var length …
CS165, Fall 2018 Stratos Idreos virtual ids/ positional alignment
columns do not need to have the A B C same width tuple 1 a1 b1 c1 tuple 2 a2 b2 c2 a3 b3 c3 tuple 3 fixed-width tuple 4 a4 b4 c4 tuple 5 a5 b5 c5 + tuple 6 a6 b6 c6 dense
positional lookups/joins A(i) = A + i * width(A)
CS165, Fall 2018 Stratos Idreos row-store column-store A BC D A B C D
CS165, Fall 2018 Stratos Idreos today column-stores 2.0
CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20
disk memory A B C D A<10 IDs B B<20 IDs C minC
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20
disk memory A B C D A<10 IDs B B<20 IDs C minC
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20
disk memory A B C D A<10 IDs B B<20 IDs C minC
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20
disk memory A B C D A<10 IDs B B<20 IDs C minC
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20
disk memory A B C D A<10 IDs B B<20 IDs C minC
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20
disk memory A B C D A<10 IDs B B<20 IDs C minC
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos working over fixed width & dense columns select for (i=0;i with data being memory resident these become significant cost components CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions … CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions … project: basic one + more if you decide to invest in this area midterm: basic one + 2-3 alternatives CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC late tuple reconstruction/materialization only reconstruct to present results no need to assemble tuples minimize memory footprint minimize data we are moving up the memory hierarchy but requires new processing engine CS165, Fall 2018 Stratos Idreos disk memory A BC A B C D row-store option1 engine A column- option2 store engine CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC possible data flow patterns tuple at a time block/vector at a time column at a time CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20 A B C D A<10 IDs B B<20 IDs C minC column- A B C D A<10 IDs B B<20 IDs C minC vector- CS165, Fall 2018 Stratos Idreos the beer analogy Marcin Zukowski, PhD CEO/Co-founder of Vectorwise (now Actian) now: “changing the world, one terabyte at a time” co-founder of Snowflake CS165, Fall 2018 Stratos Idreos A B A op1 op2 op3 CPU query plan faster registers on chip cache on board cache memory A B disk size of vector cheaper CS165, Fall 2018 Stratos Idreos tuple at a time - minimizing memory footprint bulk processing - minimizing functional overhead vectorized processing - somewhere in between CS165, Fall 2018 Stratos Idreos ~1960s 1980s: ideas about block 2005: vectorwise processing tuple at a time tuple at a time tuple at a time history/timeline >2010: industry adoption CS165, Fall 2018 Stratos Idreos project: column-at-a-time bonus: vectorized processing CS165, Fall 2018 Stratos Idreos update row7=(A=a,B=b,C=c,D=d) row-store column-store A BC D A B C D vs which is better to update and why? how much does it cost to update a single row? (think about pages, data movement) how to update in column-stores? CS165, Fall 2018 (query plan + algorithms) Stratos Idreos query update A B C D A B C D periodically base data pending updates CS165, Fall 2018 Stratos Idreos query fractured mirrors optimizer A B C D A BC D columns copy rows copy A case for fractured mirrors Ravishankar Ramamurthy, David J. DeWitt, Qi Su Very Large Databases Journal, 12(2): 89-101, 2003 CS165, Fall 2018 Stratos Idreos Notes to remember column-stores great for analytics row-stores great for transactions still basic concepts are the same hybrids possible keep access patterns sequential and simple (min ifs) CS165, Fall 2018 Stratos Idreos reading Read: The Design and Implementation of Modern Column- store Database Systems (Sections: all -4.6 & 4.8) by D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden Read: IEEE Data Engineering Bulletin, 35(1), March 2012 Special Issue on Column-stores (9 short overview papers) CS165, Fall 2018 Stratos Idreos research papers Read: Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz, Stefan Manegold, Martin Kersten In Proc. of the Very Large Databases Conference (VLDB), 1999 Browse: MonetDB/X100: Hyper-Pipelining Query Execution Peter A. Boncz, Marcin Zukowski, Niels Nes In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2005 Browse: Materialization Strategies in a Column-Oriented DBMS Daniel Abadi, Daniel Myers, David DeWitt, Samuel Madden In Proc. of the Inter. Conference on Data Engineering (ICDE), 2007 Browse: Self-organizing tuple reconstruction in column-stores Stratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009 CS165, Fall 2018 Stratos Idreos class 5 column-stores 2.0 DATA SYSTEMS prof. Stratos Idreos