class 5 column stores 2.0 prof. Stratos Idreos

HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ worth thinking about…

what just happened? is my data?

email, cloud, social media, …

can we design systems that let us know what is going on?

CS165, Fall 2018 Stratos Idreos cool papers 2.0

The Case for RodentStore: An Adaptive, Declarative Storage System Philippe Cudré-Mauroux, Eugene Wu, Samuel Madden In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2009

Abstraction Without Regret in Systems Building: a Manifesto Christoph Koch IEEE Data Eng. Bull. 37(1): 70-79 (2014) dbTouch: Analytics at your Fingertips Stratos Idreos and Erietta Liarou In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2013

CS165, Fall 2018 Stratos Idreos design doc think, design, create 1-2 page PDF doc and ask for feedback mandatory M1-M2, optional afterwards

do not worry about perfection: fail fast wrong ideas ok if you eventually find out they are wrong :) (holds for midterms as well)

CS165, Fall 2018 Stratos Idreos my head registers ~0

2x this room on chip cache 1 min

10x this building on board cache 10 min

100x New York memory 1.5 hours

Jim Gray, IBM, Tandem, DEC, Microsoft 100Kx Pluto ACM Turing award disk 2 years ACM SIGMOD Edgar F. Codd Innovations Award

CS165, Fall 2018 Stratos Idreos the way we store data defines the possible (efficient) access methods

CS165, Fall 2018 Stratos Idreos slotted page

free_offset, N, offset1-length1, offset2-lenght2,…

scan free space var length …

CS165, Fall 2018 Stratos Idreos virtual ids/ positional alignment

columns do not need to have the A B C same width tuple 1 a1 b1 c1 tuple 2 a2 b2 c2 a3 b3 c3 tuple 3 fixed-width tuple 4 a4 b4 c4 tuple 5 a5 b5 c5 + tuple 6 a6 b6 c6 dense

positional lookups/joins A(i) = A + i * width(A)

CS165, Fall 2018 Stratos Idreos row-store column-store A BC D A B C D

CS165, Fall 2018 Stratos Idreos today column-stores 2.0

CS165, Fall 2018 Stratos Idreos min(C) R where A<10 & B<20

disk memory A B C D A<10 IDs B B<20 IDs C minC

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20

disk memory A B C D A<10 IDs B B<20 IDs C minC

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20

disk memory A B C D A<10 IDs B B<20 IDs C minC

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20

disk memory A B C D A<10 IDs B B<20 IDs C minC

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20

disk memory A B C D A<10 IDs B B<20 IDs C minC

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20

disk memory A B C D A<10 IDs B B<20 IDs C minC

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if CS165, Fall 2018 Stratos Idreos working over fixed width & dense columns select for (i=0;iv no auxiliary data, min ifs inter1[j++]=i easy to prefetch next data values fetch for (i=0;i

with data being memory resident these become significant cost components CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC

alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions …

CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC

alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions …

project: basic one + more if you decide to invest in this area midterm: basic one + 2-3 alternatives

CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC

late tuple reconstruction/materialization only reconstruct to present results no need to assemble tuples minimize memory footprint minimize data we are moving up the memory hierarchy but requires new processing engine

CS165, Fall 2018 Stratos Idreos disk memory A BC A B C D row-store option1 engine

A

column- option2 store engine

CS165, Fall 2018 Stratos Idreos A<10 IDs B B<20 IDs C minC

possible data flow patterns tuple at a time block/vector at a time column at a time

CS165, Fall 2018 Stratos Idreos select min(C) from R where A<10 & B<20 A B C D A<10 IDs B B<20 IDs C minC

column-

A B C D A<10 IDs B B<20 IDs C minC

vector-

CS165, Fall 2018 Stratos Idreos the beer analogy

Marcin Zukowski, PhD

CEO/Co-founder of Vectorwise (now Actian) now: “changing the world, one terabyte at a time” co-founder of Snowflake

CS165, Fall 2018 Stratos Idreos A B A

op1 op2 op3

CPU query plan faster registers

on chip cache

on board cache

memory A B

disk size of vector cheaper

CS165, Fall 2018 Stratos Idreos tuple at a time - minimizing memory footprint bulk processing - minimizing functional overhead vectorized processing - somewhere in between

CS165, Fall 2018 Stratos Idreos ~1960s 1980s: ideas about block 2005: vectorwise processing

tuple at a time tuple at a time tuple at a time

history/timeline

>2010: industry adoption

CS165, Fall 2018 Stratos Idreos project: column-at-a-time bonus: vectorized processing

CS165, Fall 2018 Stratos Idreos update row7=(A=a,B=b,C=c,D=d) row-store column-store A BC D A B C D

vs

which is better to update and why? how much does it cost to update a single row? (think about pages, data movement) how to update in column-stores?

CS165, Fall 2018 (query plan + algorithms) Stratos Idreos query update A B C D

A B C D periodically

base data pending updates

CS165, Fall 2018 Stratos Idreos query fractured mirrors

optimizer A B C D A BC D

columns copy rows copy A case for fractured mirrors Ravishankar Ramamurthy, David J. DeWitt, Qi Su Very Large Journal, 12(2): 89-101, 2003

CS165, Fall 2018 Stratos Idreos Notes to remember

column-stores great for analytics row-stores great for transactions still basic concepts are the same hybrids possible keep access patterns sequential and simple (min ifs)

CS165, Fall 2018 Stratos Idreos reading

Read: The Design and Implementation of Modern Column- store Database Systems (Sections: all -4.6 & 4.8) by D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden

Read: IEEE Data Engineering Bulletin, 35(1), March 2012 Special Issue on Column-stores (9 short overview papers)

CS165, Fall 2018 Stratos Idreos research papers

Read: Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz, Stefan Manegold, Martin Kersten In Proc. of the Very Large Databases Conference (VLDB), 1999

Browse: MonetDB/X100: Hyper-Pipelining Query Execution Peter A. Boncz, Marcin Zukowski, Niels Nes In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2005 Browse: Materialization Strategies in a Column-Oriented DBMS Daniel Abadi, Daniel Myers, David DeWitt, Samuel Madden In Proc. of the Inter. Conference on Data Engineering (ICDE), 2007 Browse: Self-organizing tuple reconstruction in column-stores Stratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009

CS165, Fall 2018 Stratos Idreos class 5 column-stores 2.0 DATA SYSTEMS prof. Stratos Idreos