Column-Oriented

Database Systems Tutorial

Peter Boncz (CWI)

Adapted from VLDB 2009 Tutorial Column-Oriented Systems with Daniel Abadi (Yale) Stavros Harizopuolos (HP Labs)

1 What is a column-store?

row-store column-store

Date Store Product Customer Price Date Store Product Customer Price

+ easy to add/modify a record + only need to read in relevant data

- might read in unnecessary data - tuple writes require multiple accesses

=> suitable for read-mostly, read-intensive, large data repositories

Tutorial Column-Oriented Database Systems 2 Are these two fundamentally different?

 The only fundamental difference is the storage layout  However: we need to look at the big picture

different storage layouts proposed row-stores row-stores++ row-stores++ converge?

‘70s ‘80s ‘90s ‘00s today column-stores new applications new bottleneck in hardware

 How did we get here, and where we are heading Part 1  What are the column-specific optimizations? Part 2  How do we improve CPU efficiency when operating on Cs Part 3 Tutorial Column-Oriented Database Systems 3 Outline

 Part 1: Basic concepts  Introduction to key features  From DSM to column-stores and performance tradeoffs  Column-store architecture overview  Will rows and columns ever converge?

 Part 2: Column-oriented execution

 Part 3: MonetDB/X100 (“VectorWise”) and CPU efficiency

Tutorial Column-Oriented Database Systems 4 Telco Data Warehousing example

 Typical DW installation dimension tables account fact table or RAM  Real-world example usage source

“One Size Fits All? - Part 2: Benchmarking toll Results” Stonebraker et al. CIDR 2007 star schema

QUERY 2 SELECT account.account_number, sum (usage.toll_airtime), sum (usage.toll_price) Column-store Row-store FROM usage, toll, source, account Query 1 2.06 300 WHERE usage.toll_id = toll.toll_id AND usage.source_id = source.source_id Query 2 2.20 300 AND usage.account_id = account.account_id Query 3 0.09 300 AND toll.type_ind in (‘AE’. ‘AA’) AND usage.toll_price > 0 Query 4 5.24 300 AND source.type != ‘CIBER’ Query 5 2.88 300 AND toll.rating_method = ‘IS’ AND usage.invoice_date = 20051013 GROUP BY account.account_number Why? Three main factors (next slides)

Tutorial Column-Oriented Database Systems 5 Telco example explained (1/3): read efficiency

row store column store

read pages containing entire rows read only columns needed one row = 212 columns! in this example: 7 columns is this typical? (it depends) caveats: • “select * ” not any faster • clever disk prefetching What about vertical partitioning? • clever tuple reconstruction (it does not work with ad-hoc queries)

Tutorial Column-Oriented Database Systems 6 Telco example explained (2/3): compression efficiency

 Columns compress better than rows  Typical row-store compression ratio 1 : 3  Column-store 1 : 10

 Why?  Rows contain values from different domains => more entropy, difficult to dense-pack  Columns exhibit significantly less entropy  Examples: Male, Female, Female, Female, Male 1998, 1998, 1999, 1999, 1999, 2000

 Caveat: CPU cost (use lightweight compression)

Tutorial Column-Oriented Database Systems 7 Telco example explained (3/3): sorting & indexing efficiency

 Compression and dense-packing free up space  Use multiple overlapping column collections  Sorted columns compress better  Range queries are faster  Use sparse clustered indexes

What about heavily-indexed row-stores? (works well for single column access, cross-column joins become increasingly expensive)

Tutorial Column-Oriented Database Systems 8 Additional opportunities for column-stores

 Block-tuple / vectorized processing  Easier to build block-tuple operators  Amortizes function-call cost, improves CPU cache performance  Easier to apply vectorized primitives  Software-based: bitwise operations  Hardware-based: SIMD Part 3  Opportunities with compressed columns  Avoid decompression: operate directly on compressed  Delay decompression (and tuple reconstruction) more  Also known as: late materialization in Part 2  Exploit columnar storage in other DBMS components  Physical design (both static and dynamic) See: Database Cracking, from CWI

Tutorial Column-Oriented Database Systems 9 “Column-Stores vs Row-Stores: How Different are They Effect on C-Store performance Really?” Abadi, Hachem, and Madden. SIGMOD 2008.

Average for SSBM queries on C-store

original

C-store Time (sec) Time

enable late column-oriented enable materialization join algorithm compression & operate on compressed

Tutorial Column-Oriented Database Systems 10 Summary of column-store key features

columnar storage Part 1 header/ID elimination  Storage layout compression Part 2 Part 3

multiple sort orders

column operators Part 1 Part 2

avoid decompression  Execution engine Part 2 late materialization

vectorized operations Part 3

 Design tools, optimizer

Tutorial Column-Oriented Database Systems 11 Outline

 Part 1: Basic concepts  Introduction to key features  From DSM to column-stores and performance tradeoffs  Column-store architecture overview  Will rows and columns ever converge?

 Part 2: Column-oriented execution

 Part 3: MonetDB/X100 (“VectorWise”) and CPU efficiency

Tutorial Column-Oriented Database Systems 12 From DSM to Column-stores TOD: Time Oriented Database – Wiederhold et al. 70s -1985: "A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975 More 1970s: Transposed files, Lorie, Batory, Svensson. “An overview of cantor: a new system for data analysis” Karasalo, Svensson, SSDBM 1983

1985: DSM paper “A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985. 1990s: Commercialization through SybaseIQ Late 90s – 2000s: Focus on main-memory performance  DSM “on steroids” [1997 – now] CWI: MonetDB  Hybrid DSM/NSM [2001 – 2004] Wisconsin: PAX, Fractured Mirrors Michigan: Data Morphing CMU: Clotho 2005 – : Re-birth of read-optimized DSM as “column-store” MIT: C-Store CWI: X100VectorWise 10+ startups Tutorial Column-Oriented Database Systems 13 “A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985. The original DSM paper

 Proposed as an alternative to NSM  2 indexes: clustered on ID, non-clustered on value  Speeds up queries projecting few columns  Requires more storage ID value 1 2 3 4 .. 0100 0962 1000 ..

Tutorial Column-Oriented Database Systems 14 Memory wall and PAX

 90s: Cache-conscious research “Cache Conscious Algorithms for from: Relational Query Processing.” Shatdal, Kant, Naughton. VLDB 1994. “DBMSs on a modern processor: “Database Architecture Optimized for and: Where does time go?” Ailamaki, to: the New Bottleneck: Memory Access.” DeWitt, Hill, Wood. VLDB 1999. Boncz, Manegold, Kersten. VLDB 1999.

 PAX: Partition Attributes Across  Retains NSM I/O pattern  Optimizes cache-to-RAM communication “Weaving Relations for Cache Performance.” Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001.

Tutorial Column-Oriented Database Systems 15 More hybrid NSM/DSM schemes

 Dynamic PAX: Data Morphing “Data morphing: an adaptive, cache-conscious storage technique.” Hankins, Patel, VLDB 2003.

 Clotho: custom layout using scatter-gather I/O “Clotho: Decoupling Memory Page Layout from Storage Organization.” Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004.

 Fractured mirrors  Smart mirroring with both NSM/DSM copies “A Case For Fractured Mirrors.” Ramamurthy, DeWitt, Su, VLDB 2002.

Tutorial Column-Oriented Database Systems 16 MonetDB (more in Part 3)

 Late 1990s, CWI: Boncz, Manegold, and Kersten  Motivation:  Main-memory  Improve computational efficiency by avoiding expression interpreter  DSM with virtual IDs natural choice  Developed new query execution algebra  Initial contributions:  Pointed out memory-wall in DBMSs  Cache-conscious projections and joins  …

Tutorial Column-Oriented Database Systems 17 2005: the (re)birth of column-stores

 New hardware and application realities  Faster CPUs, larger memories, disk bandwidth limit  Multi-terabyte Data Warehouses  New approach: combine several techniques  Read-optimized, fast multi-column access, disk/CPU efficiency, light-weight compression

 C-store paper:  First comprehensive design description of a column-store  MonetDB/X100  “proper” disk-based column store  Explosion of new products

Tutorial Column-Oriented Database Systems 18 Performance tradeoffs: columns vs. rows DSM traditionally was not favored by technology trends How has this changed?

 Optimized DSM in “Fractured Mirrors,” 2002  “Apples-to-apples” comparison “Performance Tradeoffs in Read- Optimized ” Harizopoulos, Liang, Abadi, Madden, VLDB‟06  Follow-up study “Read-Optimized Databases, In- Depth” Holloway, DeWitt, VLDB‟08  Main-memory DSM vs. NSM “DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN‟08

 Flash-disks: a come-back for PAX? “Query Processing Techniques “Fast Scans and Joins Using Flash for Solid State Drives” Drives” Shah, Harizopoulos, Tsirogiannis, Harizopoulos, Shah,

Wiener, Graefe. DaMoN‟08 Wiener, Graefe, SIGMOD‟0919 Fractured mirrors: a closer look

 Store DSM relations inside a B-tree “A Case For Fractured Mirrors” Ramamurthy,  Leaf nodes contain values DeWitt, Su, VLDB 2002.  Eliminate IDs, amortize header overhead  Custom implementation on Shore sparse Tuple TID Column Header Data B-tree on ID 1 a1 3 2 a2 3 a3 1 a1 2 a2 3 a3 4 a4 5 a5 4 a4 1 a1 a2 a3 4 a4 a5 5 a5 Similar: storage density “Efficient columnar comparable storage in B-trees” Graefe. to column stores Sigmod Record 03/2007.

Tutorial Column-Oriented Database Systems 20 Fractured mirrors: performance From PAX paper: column? time row regular DSM column?

columns projected: 1 2 3 4 5

optimized  Chunk-based tuple merging DSM  Read in segments of M pages  Merge segments in memory  Becomes CPU-bound after 5 pages

Tutorial Column-Oriented Database Systems 21 Column-scanner “Performance Tradeoffs in Read- Optimized Databases” implementation Harizopoulos, Liang, Abadi, Madden, VLDB‟06 row scanner column scanner

Joe 45 … … Joe 45 … … SELECT name, age WHERE age > 40 apply S predicate(s) S #POS 45 Joe #POS … Direct I/O Sue … prefetch ~100ms apply 1 Joe 45 worth of data predicate #1 S 2 Sue 37 … … … 45 37 … Tutorial Column-Oriented Database Systems 22 Scan performance

 Large prefetch hides disk seeks in columns  Column-CPU efficiency with lower selectivity

 Row-CPU suffers from memory stalls not shown,  Memory stalls disappear in narrow tuples details in the paper  Compression: similar to narrow

Tutorial Column-Oriented Database Systems 23 Varying prefetch size

no competing disk traffic 40 Column 2 30 Column 8 20 Column 16 Column 48 (x 128KB) 10 time (sec) time Row (any prefetch size) 0 4 8 12 16 20 24 28 32 selected bytes per tuple

 No prefetching hurts columns in single scans

Tutorial Column-Oriented Database Systems 26 Varying prefetch size

no competing with competing disk traffic disk traffic 40 40 Column, 48 40 30 30 Row, 48 30

20 20 20 Column, 8

10 (sec) time 10 10 Row, 8 0 0 0 4 8 12 16 204 1224 2028 2832 4 12 20 28 selected bytes per tuple

 No prefetching hurts columns in single scans  Under competing traffic, columns outperform rows for any prefetch size

Tutorial Column-Oriented Database Systems 27 “DSM vs. NSM: CPU performance tradeoffs in block-oriented query CPU Performance processing” Boncz, Zukowski, Nes, DaMoN‟08  Playing field: in-flight tuples, regardless input storage  How is data stored in temporary in-memory buffers?  Idea: on-the-fly conversion between NSM and DSM  Insert data storage conversion operators

Convert the entire in-flight tuple, or part of it  Effectively hybrid layout  Should be planned by query optimizer “data layout planning” Winners:  DSM: sequential access (block fits in L2), random in L1  NSM: random access, SIMD for grouped Aggregation

VLDB 2009 Tutorial Column-Oriented Database Systems 28 Even faster column scans on flash SSDs 30K Read IOps, 3K Write Iops  New-generation SSDs 250MB/s Read BW, 200MB/s Write  Very fast random reads, slower random writes  Fast sequential RW, comparable to HDD arrays  No expensive seeks across columns  FlashScan and Flashjoin: PAX on SSDs, inside Postgres

“Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD‟09

mini-pages with no qualified attributes are not accessed

Tutorial Column-Oriented Database Systems 32 Column-scan performance over time regular DSM (2001) column-store (2006) ..to 1.2x slower from 7x slower

..to 2x slower ..to same

and 3x faster!

optimized DSM (2002) SSD Postgres/PAX (2009)

Tutorial Column-Oriented Database Systems 33 Outline

 Part 1: Basic concepts  Introduction to key features  From DSM to column-stores and performance tradeoffs  Column-store architecture overview  Will rows and columns ever converge?

 Part 2: Column-oriented execution

 Part 3: MonetDB/X100 (“VectorWise”) and CPU efficiency

Tutorial Column-Oriented Database Systems 34 Architecture of a column-store storage layout  read-optimized: dense-packed, compressed  organize in extends, batch updates  multiple sort orders  sparse indexes engine  block-tuple operators  new access methods system-level  work on compressed data  system-wide column support  optimized relational operators  loading / updates  scaling through multiple nodes  transactions / redundancy

Tutorial Column-Oriented Database Systems 35 Continuous Load and Query (Vertica)

Hybrid Storage Architecture

> Write Optimized > Read Optimized Store (WOS) Store (ROS) Trickle • On disk Load • Sorted / Compressed TUPLE MOVER A B C Asynchronous • Segmented Data Transfer • Large data loaded direct

. Memory based A B C . Unsorted / Uncompressed . Segmented

. Low latency / Small quick (A B C | A) inserts

Tutorial Column-Oriented Database Systems 38 Differential Updates in Column Stores

 Problem Definition:  A table T is kept ordered on a sortkey SK . table is stored column-wise, compressed  Keep updates (INS,MOD,DEL) in a differential structure . “write store”, in C-Store/Vertica

 Questions:  What data structures/algorithms to use for updates?  How are read-only queries impacted? . Must merge differential structure with base table  Maximal update throughput that can be sustained? . Given constraints on . RAM for differences . IO bandwidth . per-modification memory cost  Currently viable approach for many scenarios

Tutorial Column-Oriented Database Systems 40 Value-based Delta Tables (VDTs)

 How is the row-store organized? SQL: T=inventory SK=[store,prod]  Simple approach:  INS(C1..Cn) table  Keeps full record  DEL(SK1..SKm)  Only stores keys of deleted tuples Physical Relational Algebra:  Modifications are DEL+INS

Problem: overhead on queries  Always a MergeUnion (and MergeDiff) CPU cost  Union/Diff need the SKs  Many queries do not need these by themselves  Now the column-store must do IO for them anyway 

Tutorial Column-Oriented Database Systems 41 Positional Updates

Remember the position of an update rather than its SK  less CPU cost (especially in block-oriented processing)  Scan passes tuples through unmodified until the next modification position  no need to scan SK columns

the SK=(store,prod) so inserts go “in between”!!

TABLEx = state of TABLE at time x

SID(t)=StableID=position of tuple t in columns=RID0(t)  Stable RIDx(t) = RowID of tuple t at time x  HIGHLY VOLATILE!!

Tutorial Column-Oriented Database Systems 42 RIDs and SIDs TABLEx = state of TABLE at time x

SID(t)=StableID=position of tuple t in columns=RID0(t)  Stable RIDx(t) = RowID of tuple t at time x  HIGHLY VOLATILE!!

The difference between RID and SID:

Tutorial Column-Oriented Database Systems 43 Enter Positional Delta Trees

 A counting B-Tree that keeps track of cumulative tuple shifts

 DIFF^ta_tb maintains a monotonic mapping between RIDx(ta) and RIDy(tb)  DIFF = (PDT,VALS), VALS = value space  Merge Operator  used by each query to get up-todate table

Tutorial Column-Oriented Database Systems 44 Enter Positional Delta Trees

 A counting B-Tree that keeps track of cumulative tuple shifts

 DIFF^ta_tb maintains a monotonic mapping between RIDx(ta) and RIDy(tb)  DIFF = (PDT,VALS), VALS = value space  Merge Operator  used by each query to get up-to-date table

Tutorial Column-Oriented Database Systems 45 PDT-based Transactions

 Propagate(PDT^t2_t1,PDT^t3_t2)  Requires Consecutive PDTs

 Serialize(PDT^t2_t0,PDT^t3_t1)  PDT^t3_t2  Makes Overlapping PDTs Aligned  May Fail!!  transactional conflict detection

Tutorial Column-Oriented Database Systems 46 Column-Oriented

Database Systems Tutorial

Compression

“Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE‟06

“Integrating Compression and Execution in Column- Oriented Database Systems” Abadi, Madden, and Ferreira, SIGMOD ‟06

•Query optimization in compressed database systems” Chen, Gehrke, Korn, SIGMOD‟01 Compression

 Trades I/O for CPU  Increased column-store opportunities:  Higher data value locality in column stores  Techniques such as run length encoding far more useful  Can use extra space to store multiple copies of data in different sort orders

Tutorial Column-Oriented Database Systems 57 Run-length Encoding

Quarter Product ID Price Quarter Product ID Price Q1 1 5 (value, start_pos, run_length) (value, start_pos, run_length) (1, 1, 5) 5 Q1 1 7 (Q1, 1, 300) (2, 6, 2) 7 Q1 1 2 (Q2, 301, 350) 2 Q1 1 9 … 9 Q1 1 6 (Q3, 651, 500) (1, 301, 3) 6 Q1 2 8 (Q4, 1151, 600) (2, 304, 1) 8 Q1 2 5 … … … … 5 … Q2 1 3 3 Q2 1 8 8 Q2 1 1 1 Q2 2 4 … … … 4 …

Tutorial Column-Oriented Database Systems 58 “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ‟06 Bit-vector Encoding

 For each unique Product ID ID: 1 ID: 2 ID: 3 … value, v, in column c, create bit-vector b 1 1 0 0 0  b[i] = 1 if c[i] = v 1 1 0 0 0  Good for columns 1 1 0 0 0 with few unique 1 1 0 0 0 values 1 1 0 0 0  Each bit-vector can 2 0 1 0 0 be further 2 0 1 0 0 compressed if sparse … … … … … 1 1 0 0 0 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 … … … … …

Tutorial Column-Oriented Database Systems 59 “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ‟06 Dictionary Encoding

 For each unique Quarter Quarter value create 0 Quarter dictionary entry 1 Q1 3  Dictionary can be 0 24 Q2 2 per-block or per- 0 128 Q4 0 column 0 122 Q1 1  Column-stores 3 have the Q3 2 2 + advantage that Q1 OR dictionary entries + may encode Q1 Dictionary Q1 Dictionary multiple values at 24: Q1, Q2, Q4, Q1 once Q2 0: Q1 … Q4 1: Q2 122: Q2, Q4, Q3, Q3 Q3 2: Q3 … Q3 3: Q4 128: Q3, Q1, Q1, Q1 …

Tutorial Column-Oriented Database Systems 60 Frame Of Reference Encoding

Price Price  Encodes values as b bit offset Frame: 50 45 from chosen frame of -5 54 reference 4 48  Special escape code (e.g. all -2 55 4 bits per bits set to 1) indicates a 5 51 value difference larger than can be 1 53 stored in b bits 3 40  After escape code, original ∞ 50 (uncompressed) value is 40 49 Exceptions (see written 0 part 3 for a better 62 way to deal with -1 exceptions) 52 ∞ “Compressing Relations and Indexes 50 … 62 ” Goldstein, Ramakrishnan, Shaft, 2 ICDE‟98 0 … Tutorial Column-Oriented Database Systems 61 Differential Encoding

Time  Encodes values as b bit offset Time from previous value 5:00 5:00  Special escape code (just like frame of reference encoding) 5:02 2 indicates a difference larger than 5:03 1 can be stored in b bits 2 bits per  After escape code, original 5:03 0 value (uncompressed) value is written 5:04 1  Performs well on columns 5:06 2 containing increasing/decreasing sequences 5:07 1  inverted lists 5:08 1  timestamps 5:10 2  object IDs Exception (see 5:15 ∞ X100 for a better  sorted / clustered columns way to deal with 5:16 5:15 exceptions) “Improved Word-Aligned Binary 5:16 1 Compression for Text Indexing” … 0 Ahn, Moffat, TKDE‟06

Tutorial Column-Oriented Database Systems 62 What Compression Scheme To Use?

Does column appear in the sort key?

yes no Is the average Are number of unique run-length > 2 values < ~50000 yes no yes Differential RLE Encoding no Does this column appear frequently in selection predicates?

yes no Is the data numerical and exhibit good locality?

yes no Bit-vector Dictionary Compression Compression Frame of Reference Leave Data Encoding Uncompressed OR Heavyweight Compression

Tutorial Column-Oriented Database Systems 63 “Super-Scalar RAM-CPU Cache Compression” Zukowski, Heman, Nes, Boncz, ICDE‟06 Heavy-Weight Compression Schemes

 Modern disk arrays can achieve > 1GB/s  1/3 CPU for decompression  3GB/s needed

 Lightweight compression schemes are better

 Even better: operate directly on compressed data

Tutorial Column-Oriented Database Systems 64 “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ‟06 Operating Directly on Compressed Data

 I/O - CPU tradeoff is no longer a tradeoff  Reduces memory–CPU bandwidth requirements  Opens up possibility of operating on multiple records at once

Tutorial Column-Oriented Database Systems 65 “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ‟06 Operating Directly on Compressed Data

Quarter Product ID 1 2 3 … … … ProductID, COUNT(*)) (Q1, 1, 300) 1 0 0 301-306 0 0 1 (1, 3) (Q2, 301, 6) 0 1 0 (2, 1) (Q3, 307, 500) 1 0 0 1 0 0 (3, 2) (Q4, 807, 600) 0 0 1 0 1 0 SELECT ProductID, Count(*) 1 0 0 FROM table 0 0 1 WHERE (Quarter = Q2) 0 1 0 GROUP BY ProductID Index Lookup + Offset jump 0 0 1 … … …

Tutorial Column-Oriented Database Systems 66 “Integrating Compression and Execution in Column- Oriented Database Systems” Abadi et. al, SIGMOD ‟06 Operating Directly on Compressed Data

SELECT ProductID, Count(*) Block API FROM table WHERE (Quarter = Q2) Data GROUP BY ProductID Aggregation isOneValue() Operator isValueSorted() isPosContiguous() isSparse()

Selection getNext() Operator decompressIntoArray() (Q1, 1, 300) getValueAtPosition(pos) (Q2, 301, 6) getMin() Compression- getMax() (Q3, 307, 500) Aware Scan getSize() (Q4, 807, 600) Operator

Tutorial Column-Oriented Database Systems 67 Column-Oriented

Database Systems Tutorial

Tuple Materialization and Column-Oriented Join Algorithms

“Materialization Strategies in a Column- “Query Processing Techniques for Oriented DBMS” Abadi, Myers, DeWitt, Solid State Drives” Tsirogiannis, and Madden. ICDE 2007. Harizopoulos Shah, Wiener, and Graefe. SIGMOD 2009. “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, “Cache-Conscious Radix-Decluster Kersten, SIGMOD 2009 Projections”, Manegold, Boncz, Nes, VLDB 2004 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. When should columns be projected?

 Where should column projection operators be placed in a query plan?  Row-store:  Column projection involves removing unneeded columns from tuples  Generally done as early as possible  Column-store:  Operation is almost completely opposite from a row-store  Column projection involves reading needed columns from storage and extracting values for a listed set of tuples . This process is called “materialization”  Early materialization: project columns at beginning of query plan . Straightforward since there is a one-to-one mapping across columns  Late materialization: wait as long as possible for projecting columns . More complicated since selection and join operators on one column obfuscates mapping to other columns from same table  Most column-stores construct tuples and column projection time . Many database interfaces expect output in regular tuples (rows) . Rest of discussion will focus on this case

Tutorial Column-Oriented Database Systems 69 When should tuples be constructed?

Select + Aggregate QUERY: 4 2 2 7 SELECT custID,SUM(price) FROM table 4 1 3 13 WHERE (prodID = 4) AND 4 3 3 42 (storeID = 1) AND GROUP BY custID 4 1 3 80

Construct  Solution 1: Create rows first (EM). But:  Need to construct ALL tuples (4,1,4) 2 2 7 1 3 13  Need to decompress data 3 3 42  Poor memory bandwidth utilization 1 3 80

prodID storeID custID price

Tutorial Column-Oriented Database Systems 70 Solution 2: Operate on columns

QUERY: 1 0 SELECT custID,SUM(price) 1 1 AGG FROM table 1 0 WHERE (prodID = 4) AND 1 1 (storeID = 1) AND Data Data GROUP BY custID Source Source custID price Data Data Source Source AND 4 2 2 7 4 2 4 1 3 13

Data Data 4 1 4 3 3 42 Source Source 4 3 prodID storeID 4 1 3 80 4 1 prodID storeID custID price prodID storeID

Tutorial Column-Oriented Database Systems 71 Solution 2: Operate on columns

QUERY: 0 SELECT custID,SUM(price) AGG 1 FROM table WHERE (prodID = 4) AND 0 (storeID = 1) AND Data Data 1 GROUP BY custID Source Source custID price AND AND 4 2 2 7 1 0 4 1 3 13 1 1 Data Data 4 3 3 42 Source Source 1 0 prodID storeID 4 1 3 80 1 1 prodID storeID custID price

Tutorial Column-Oriented Database Systems 72 Solution 2: Operate on columns

QUERY: 3 131 SELECT custID,SUM(price) AGG 3 801 FROM table WHERE (prodID = 4) AND (storeID = 1) AND 02 07 Data Data GROUP BY custID Source Source 13 Data Data 131 custID price 03 Source Source 420 13 801 AND custID price 4 2 2 7 0 4 1 3 13 1 Data Data 4 3 3 42 Source Source 0 prodID storeID 4 1 3 80 1 prodID storeID custID price

Tutorial Column-Oriented Database Systems 73 Solution 2: Operate on columns

QUERY: SELECT custID,SUM(price) AGG FROM table WHERE (prodID = 4) AND 13 931 (storeID = 1) AND Data Data GROUP BY custID Source Source custID price

AGG AND 4 2 2 7 4 1 3 13 13 131 Data Data 4 3 3 42 Source Source 13 801 prodID storeID 4 1 3 80 prodID storeID custID price

Tutorial Column-Oriented Database Systems 74 “Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007. For plans without joins, late materialization is a win

10 QUERY: 9 8 SELECT C1, SUM(C2) 7 FROM table 6 WHERE (C1 < CONST) AND 5 (C < CONST) 4 2 3 GROUP BY C1

Time (seconds) 2 1  Ran on 2 compressed 0 columns from TPC-H Low selectivity Medium High selectivity selectivity scale 10 data

Early Materialization Late Materialization

Tutorial Column-Oriented Database Systems 75 “Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007. Even on uncompressed data, late materialization is still a win

14 QUERY:

12 SELECT C1, SUM(C2) 10 FROM table

8 WHERE (C1 < CONST) AND 6 (C2 < CONST) GROUP BY C1

4 Time (seconds) 2 0  Materializing late still Low selectivity Medium High selectivity works best selectivity

Early Materialization Late Materialization

Tutorial Column-Oriented Database Systems 76 What about for plans with joins?

Select R1.B, R1.C, R2.E, R2.H, R3.F B, C, E, H, F From R1, R2, R3 Project Where R1.A = R2.D AND R2.G = R3.K

B, C, E, H, F Join F B, C G = K Join 2 G = K G K B, C, E, G, H Early F, K ProjectLate Scan materializationJoin 1 materializationR3 (F, K, L) A = D Scan Join A = D R3 (F, K, L) G E, H A, B, C D, E, G, H A D

Scan Scan Scan Scan

R1 (A, B, C) R2 (D, E, G, H) R1 (A, B, C) R2 (D, E, G, H) Tutorial Column-Oriented Database Systems 77 77 Early Materialization Example

2 7 QUERY: SELECT C.lastName,SUM(F.price) 3 13 1 Green FROM facts AS F, customers AS C 3 42 2 White WHERE F.custID = C.custID GROUP BY C.lastName 3 80 3 Brown

Construct Construct

(4,1,4) 12 6 2 7 1 1 3 13 1 Green 11 2 3 42 2 White 1 1 3 80 3 Brown

prodID storeID quantity custID price custID lastName Facts Customers

Tutorial Column-Oriented Database Systems 78 Early Materialization Example

7 White QUERY: SELECT C.lastName,SUM(F.price) 13 Brown FROM facts AS F, customers AS C 42 Brown WHERE F.custID = C.custID GROUP BY C.lastName 80 Brown

Join

2 7 3 13 1 Green 3 42 2 White 3 80 3 Brown

Tutorial Column-Oriented Database Systems 79 Late Materialization Example

1 2 QUERY: 2 3 SELECT C.lastName,SUM(F.price) 3 1 FROM facts AS F, customers AS C WHERE F.custID = C.custID 4 3 GROUP BY C.lastName

Join Late materialized join causes out of order probing of projected (4,1,4) 12 6 2 7 columns from the inner 1 1 3 13 1 Green relation 11 2 1 42 2 White 1 1 3 80 3 Brown prodID storeID quantity custID price custID lastName Facts Customers

Tutorial Column-Oriented Database Systems 80 Late Materialized Join Performance

 Naïve LM join about 2X slower than EM join on typical queries (due to random I/O)  This number is very dependent on  Amount of memory available  Number of projected attributes  Join cardinality  But we can do better  Invisible Join  Jive/Flash Join  Radix cluster/decluster join

Tutorial Column-Oriented Database Systems 81 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Invisible Join Madden, and Hachem. SIGMOD 2008.

 Designed for typical joins when data is modeled using a star schema  One (“fact”) table is joined with multiple dimension tables

 Typical query:

select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA„ and s_region = 'ASIA„ and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, revenue desc;

Tutorial Column-Oriented Database Systems 82 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. Invisible Join

Apply “region = „Asia‟” On Customer Table custkey region nation … 1 ASIA CHINA … Hash Table (or bit-map) 2 ASIA INDIA … Containing Keys 1, 2 and 3 3 ASIA INDIA … 4 EUROPE FRANCE …

Apply “region = „Asia‟” On Supplier Table suppkey region nation … 1 ASIA RUSSIA … Hash Table (or bit-map) 2 EUROPE SPAIN … Containing Keys 1, 3 3 ASIA JAPAN …

Apply “year in [1992,1997]” On Date Table dateid year … 01011997 1997 … Hash Table Containing Keys 01021997 1997 … 01011997, 01021997, and 01031997 1997 … 01031997

Tutorial Column-Oriented Database Systems 83 “Column-Stores vs Row-Stores: Original Fact Table How Different are They Really?” orderkey custkey suppkey orderdate revenue Abadi et. al. SIGMOD 2008. 1 3 1 01011997 43256 2 3 2 01011997 33333 1 1 1 1 3 4 3 01021997 12121 1 0 1 0 4 1 1 01021997 23233 0 1 1 0 5 4 2 01021997 45456 1 1 1 1 6 1 2 01031997 43251 0 & 0 & 1 = 0 7 3 2 01031997 34235 1 0 1 0 1 0 1 0

Hash Table Hash Table Hash Table Containing Containing Keys Containing Keys 01011997, 1, 2 and 3 Keys 1 and 3 01021997, and 01031997 + + + custkey suppkey orderdate 3 1 1 1 01011997 1 3 1 2 0 01011997 1 4 0 3 1 01021997 1 1 1 1 1 01021997 1 4 0 2 0 01021997 1 1 1 2 0 01031997 1 3 1 2 0 01031997 1

Tutorial Column-Oriented Database Systems 84 custkey 3 1 0 3 CHINA 4 0 3 INDIA 1 1 INDIA CHINA 1 + INDIA + 0 = 4 FRANCE 1 0 3 0 suppkey 1 1 2 0 3 0 1 RUSSIA RUSSIA 1 1 1 SPAIN RUSSIA 0 + = 2 + JAPAN 2 0 2 0 orderdate 01011997 1 01011997 0 01011997 1997 01021997 0 01011997 1997 JOIN 01021997 1997 1997 01021997 + 1 01021997 01031997 1997 = 01021997 0 01031997 0 01031997 0

Tutorial Column-Oriented Database Systems 85 “Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008. Invisible Join

Apply “region = „Asia‟” On Customer Table custkey region nation … 1 ASIA CHINA … Hash Table (or bit-map) 2 ASIA INDIA … Containing Keys 1, 2 and 3 3 ASIA INDIA … 4 EUROPE FRANCE … Range [1-3] (between-predicate rewriting) Apply “region = „Asia‟” On Supplier Table suppkey region nation … 1 ASIA RUSSIA … Hash Table (or bit-map) 2 EUROPE SPAIN … Containing Keys 1, 3 3 ASIA JAPAN …

Apply “year in [1992,1997]” On Date Table dateid year … 01011997 1997 … Hash Table Containing Keys 01021997 1997 … 01011997, 01021997, and 01031997 1997 … 01031997

Tutorial Column-Oriented Database Systems 86 Invisible Join

Bottom Line  Many data warehouses model data using star/snowflake schemes  Joins of one (fact) table with many dimension tables is common  Invisible join takes advantage of this by making sure that the table that can be accessed in position order is the fact table for each join  Position lists from the fact table are then intersected (in position order)  This reduces the amount of data that must be accessed out of order from the dimension tables  “Between-predicate rewriting” trick not relevant for this discussion

Tutorial Column-Oriented Database Systems 87 custkey 3 1 0 3 CHINA 4 0 3 INDIA 1 1 INDIA CHINA 1 + INDIA + 0 = 4 FRANCE 1 0 3 0 suppkey Still accessing table out of order 1 1 2 0 3 0 1 RUSSIA RUSSIA 1 1 1 SPAIN RUSSIA 0 + = 2 + JAPAN 2 0 2 0 orderdate 01011997 1 01011997 0 01011997 1997 01021997 0 01011997 1997 JOIN 01021997 1997 1997 01021997 + 1 01021997 01031997 1997 = 01021997 0 01031997 0 01031997 0

Tutorial Column-Oriented Database Systems 88 Jive/Flash Join

3 CHINA INDIA 1 INDIA CHINA + INDIA = FRANCE “Fast Joins using Join Indices”. Li and Ross, VLDBJ 8:1-24, 1999. Still accessing table out of order

“Query Processing Techniques for Solid State Drives”. Tsirogiannis, Harizopoulos et. al. SIGMOD 2009.

Tutorial Column-Oriented Database Systems 89 Jive/Flash Join

1. Add column with dense ascending integers from 1 1 3 CHINA INDIA 2. Sort new position list by 2 1 INDIA CHINA + INDIA = second column FRANCE 3. Probe projected column in order using new sorted position list, keeping first column from position list around 2 1 CHINA 2 CHINA 1 3 INDIA 1 INDIA 4. Sort new result by first + INDIA = column FRANCE

1 INDIA 2 CHINA

Tutorial Column-Oriented Database Systems 90 Jive/Flash Join

Bottom Line  Instead of probing projected columns from inner table out of order:  Sort join index  Probe projected columns in order  Sort result using an added column  LM vs EM tradeoffs:  LM has the extra sorts (EM accesses all columns in order)  LM only has to fit join columns into memory (EM needs join columns and all projected columns) . Results in big memory and CPU savings (see part 3 for why there is CPU savings)  LM only has to materialize relevant columns  In many cases LM advantages outweigh disadvantages  LM would be a clear winner if not for those pesky sorts … can we do better?

Tutorial Column-Oriented Database Systems 91 Radix Cluster/Decluster

 The full sort from the Jive join is actually overkill  We just want to access the storage blocks in order (we don‟t mind random access within a block)  So do a radix sort and stop early  By stopping early, data within each block is accessed out of order, but in the order specified in the original join index  Use this pseudo-order to accelerate the post-probe sort as well

•“Database Architecture Optimized for the New Bottleneck: Memory Access” “Cache-Conscious Radix-Decluster VLDB‟99 Projections”, Manegold, Boncz, Nes, •“Generic Database Cost Models for VLDB‟04 Hierarchical Memory Systems”, VLDB‟02 (all Manegold, Boncz, Kersten)

Tutorial Column-Oriented Database Systems 92 Radix Cluster/Decluster Bottom line  Both sorts from the Jive join can be significantly reduced in overhead  Only been tested when there is sufficient memory for the entire join index to be stored three times  Technique is likely applicable to larger join indexes, but utility will go down a little  Only works if random access within a storage block  Don‟t want to use radix cluster/decluster if you have variable- width column values or compression schemes that can only be decompressed starting from the beginning of the block

Tutorial Column-Oriented Database Systems 93 LM vs EM joins

 Invisible, Jive, Flash, Cluster, Decluster techniques contain a bag of tricks to improve LM joins  Research papers show that LM joins become 2X faster than EM joins (instead of 2X slower) for a wide array of query types

Tutorial Column-Oriented Database Systems 94 Tuple Construction Heuristics

 For queries with selective predicates, aggregations, or compressed data, use late materialization  For joins:  Research papers:  Always use late materialization  Commercial systems:  Inner table to a join often materialized before join (reduces system complexity):  Some systems will use LM only if columns from inner table can fit entirely in memory

Tutorial Column-Oriented Database Systems 95 Outline

 Part 1: Basic concepts  Introduction to key features  From DSM to column-stores and performance tradeoffs  Column-store architecture overview  Will rows and columns ever converge?  Part 2: Column-oriented execution

 Part 3: MonetDB/X100 (“VectorWise”) and CPU efficiency

Tutorial Column-Oriented Database Systems 96 Outline

 Computational Efficiency of DB on modern hardware  how column-stores can help here  MonetDB & VectorWise in more depth

 CPU efficient column compression  vectorized decompression

 Conclusions  future work

Tutorial Column-Oriented Database Systems 97 Column-Oriented

Database Systems Tutorial

40 years of hardware evolution vs. DBMS computational efficiency CPU Architecture

Elements:  Storage  CPU caches L1/L2/L3  Registers  Execution Unit(s)  Pipelined  SIMD

Tutorial Column-Oriented Database Systems 99 Super-Scalar Execution (pipelining)

Tutorial Column-Oriented Database Systems 102 Hazards

 Data hazards  Control Hazards  Dependencies between instructions  Branch mispredictions  L1 data cache misses  Computed branches (late binding)  L1 instruction cache misses

Result: bubbles in the pipeline

Out-of-order execution addresses data hazards  control hazards typically more expensive

Tutorial Column-Oriented Database Systems 103 SIMD

 Single Instruction Multiple Data  Same operation applied on a vector of values  MMX: 64 bits, SSE: 128bits, AVX: 256bits  SSE, e.g. multiply 8 short integers

Tutorial Column-Oriented Database Systems 104 A Look at the Query Pipeline

102 ivan 350

next() 377 *– 5030 SELECT id, name (age-30)*50 AS bonus 102 ivan 37 7 350 FROM employee PROJECT WHERE age > 30

next() 2237 > 30 ? 102101 ivanalice 3722 FALSETRUE SELECT next() 101102 aliceivan 2237 SCAN

Tutorial Column-Oriented Database Systems 105 A Look at the Query Pipeline

102 ivan 350 next() 377 *– 5030 Operators

PROJECT Iterator interface next() 2237 > 30 ? -open() -next(): tuple SELECT -close() next() SCAN

Tutorial Column-Oriented Database Systems 106 A Look at the Query Pipeline

102 ivan 350 next() 377 *– 5030 Primitives 102 ivan 37 7 350

PROJECT Provide computational next() 2237 > 30 ? functionality 102101 ivanalice 3722 FALSETRUE SELECT All arithmetic allowed in expressions, next() e.g. Multiplication 101102 aliceivan 2237 SCAN 7 * 50 mult(int,int)  int

Tutorial Column-Oriented Database Systems 107 Database Architecture causes Hazards

 Data hazards  Control Hazards  Dependencies between instructions  Branch mispredictions  L1 data cache misses  Computed branches (late binding) SIMD Out-of-order Execution  L1 instruction cache misses

work on one tuple at a time Large Tree/Hash Structures PROJECT Operators Code footprint of all operators in query plan exceeds L1 cache SELECT Iterator interface -open() Data-dependent conditions -next(): tuple Next() late binding SCAN -close() method calls “DBMSs On A Modern Processor: Where Does Time Go? ” Tree, List, Hash traversal Complex NSM record navigation Ailamaki, DeWitt, Hill, Wood, VLDB‟99

Tutorial Column-Oriented Database Systems 108 DBMS Computational Efficiency TPC-H 1GB, query 1  selects 98% of fact table, computes net prices and aggregates all  Results:  C program: ?  MySQL: 26.2s  DBMS “X”: 28.1s

“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05

Tutorial Column-Oriented Database Systems 109 DBMS Computational Efficiency TPC-H 1GB, query 1  selects 98% of fact table, computes net prices and aggregates all  Results:  C program: 0.2s  MySQL: 26.2s  DBMS “X”: 28.1s

“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05

Tutorial Column-Oriented Database Systems 110 Column-Oriented

Database Systems Tutorial

MonetDB a column-store

 “save disk I/O when scan-intensive queries need a few columns”  “avoid an expression interpreter to improve computational efficiency”

Tutorial Column-Oriented Database Systems 112 RISC Database Algebra

CPU ? Give it “nice” code !

- few dependencies (control,data) - CPU gets out-of-order execution - compiler can e.g. generate SIMD Simple, hard- One loop for an entire column -batcalc_minus_int(int*no per-tuple interpretation res, coded semantics int* col, - arrays: no record navigation int val, in operators - better instruction cacheint localityn) { MATERIALIZED SELECTfor(i=0;id, name, i 30

Tutorial Column-Oriented Database Systems 113 a column-store

 “save disk I/O when scan-intensive queries need a few columns”

 “avoid an expression interpreter to improve computational efficiency” How?

 RISC query algebra: hard-coded semantics

 Decompose complex expressions in multiple operations

 Operators only handle simple arrays

 No code that handles slotted buffered record layout

 Relational algebra becomes array manipulation language

 Often SIMD for free

 Plus: use of cache-conscious algorithms for Sort/Aggr/Join

Tutorial Column-Oriented Database Systems 114 a Faustian pact

 You want efficiency  Simple hard-coded operators  I take scalability  Result materialization

 C program: 0.2s

 MonetDB: 3.7s

 MySQL: 26.2s

 DBMS “X”: 28.1s

Tutorial Column-Oriented Database Systems 115 SIGMOD 1985

•“MIL Primitives for Querying a Fragmented World”, Boncz, Kersten, VLDBJ‟98 •“Flattening an Object Algebra to Provide Performance” Boncz, Wilschut, Kersten, ICDE‟98 •“MonetDB/XQuery: a fast XQuery processor powered by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD‟06 •“SW-Store: a vertically partitioned DBMS for Semantic Web data management“ Abadi, Marcus, Madden, Hollenbach, VLDBJ‟09

Tutorial Column-Oriented Database Systems 117 optimizer frontend backend as a research platform •“Database Architecture Optimized for the New  Cache-Conscious Joins Bottleneck: Memory Access” VLDB‟99  Cost Models, Radix-cluster, •“Generic Database Cost Models for Hierarchical Memory Radix-decluster Systems”, VLDB‟02 (all Manegold, Boncz, Kersten) •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04  MonetDB/XQuery:  structural joins exploiting “MonetDB/XQuery: a fast XQuery processor powered positional column access by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD‟06

 Cracking: “Database Cracking”, CIDR‟07  on-the-fly automatic indexing “Updating a cracked database “, SIGMOD‟07 without workload knowledge “Self-organizing tuple reconstruction in column- stores“, SIGMOD‟09 (all Idreos, Manegold, Kersten)

 Recycling: “An architecture for recycling intermediates in a  using materialized intermediates column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD‟09

 Run-time Query Optimization: “ROX: run-time optimization of XQueries”,  correlation-aware run-time Abdelkader, Boncz, Manegold, vanKeulen, optimization without cost model SIGMOD‟09 Tutorial Column-Oriented Database Systems 119 Cache-Conscious Joins Radix-Cluster

 Radix-Partitioned Hash Join  create partitions << CPU cache  small partitions  many partitions  many partitions  multiple passes needed

•“Database Architecture Optimized for the New Bottleneck: Memory Access” VLDB‟99 •“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB‟02 (all Manegold, Boncz, Kersten)

VLDB 2009 Tutorial Column-Oriented Database Systems 120 Cache-Conscious Joins Radix-Cluster

 Radix-Partitioned Hash Join  create partitions << CPU cache  small partitions  many partitions  many partitions  multiple passes needed

 Radix-Cluster  Radix-Sort with early stopping  Each pass looks at Bi higher-most radix bits Bi  Splitting each input cluster into 2 output clusters  leaves relation partially ordered

•“Database Architecture Optimized for the New Bottleneck: Memory Access” VLDB‟99 •“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB‟02 (all Manegold, Boncz, Kersten)

VLDB 2009 Tutorial Column-Oriented Database Systems 121 Cache-Conscious Joins Radix-Cluster

 Multiple clustering passes  Limit number of clusters per pass  Avoid cache/TLB trashing  Trade memory cost for CPU cost  Any data type (hashing)

•“Database Architecture Optimized for the New Bottleneck: Memory Access” VLDB‟99 •“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB‟02 (all Manegold, Boncz, Kersten)

VLDB 2009 Tutorial Column-Oriented Database Systems 122 Cache-Conscious Joins Accurate Cache Miss Cost Modeling

VLDB 2009 Tutorial Column-Oriented Database Systems 123 Cache-Conscious Joins Partitioned Hash Join

VLDB 2009 Tutorial Column-Oriented Database Systems 124 Cache-Conscious Joins Radix-Clustered Hash-Join: overall perf

(64,000,000 tuples)

VLDB 2009 Tutorial Column-Oriented Database Systems 125 Cache-Conscious Joins Pre-Projection vs. Post-Projection

VLDB 2009 Tutorial Column-Oriented Database Systems 126 Cache-Conscious Joins Pre-Projection vs. Post-Projection

 Radix-Decluster = cache-conscious post-projection

VLDB 2009 Tutorial Column-Oriented Database Systems 127 Cache-Conscious Joins Post-Projection •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Partitioned Hash-Join

Join Index!

“Join Indices” Valduriez, TODS‟87

VLDB 2009 Tutorial Column-Oriented Database Systems 128 Cache-Conscious Joins Post-Projection •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Partitioned Hash-Join  Cluster on Left

VLDB 2009 Tutorial Column-Oriented Database Systems 129 Cache-Conscious Joins Post-Projection •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Partitioned Hash-Join  Cluster on Left

0 1 2 3 4 . . N-1

VLDB 2009 Tutorial Column-Oriented Database Systems 130 Cache-Conscious Joins Post-Projection •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Partitioned Hash-Join  Cluster on Left  Project Left

VLDB 2009 Tutorial Column-Oriented Database Systems 131 Cache-Conscious Joins Post-Projection •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Partitioned Hash-Join 2  Cluster on Left 4  Project Left 6 1  Cluster on Right 1 3 7 0 58 85

VLDB 2009 Tutorial Column-Oriented Database Systems 132 Cache-Conscious Joins Post-Projection •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Partitioned Hash-Join  Cluster on Left  Project Left  Cluster on Right  Project Right  Radix-Decluster

VLDB 2009 Tutorial Column-Oriented Database Systems 133 Cache-Conscious Joins Radix-Decluster •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04 2 4  Red column forms a dense domain  {0,1,2,…,N-1} 6  All subsequences are ordered 1  e.g. 3  Task: merge subsequences into dense sequence 7 0 58 85

VLDB 2009 Tutorial Column-Oriented Database Systems 134 Cache-Conscious Joins Radix-Decluster •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04 2 4  Red column forms a dense domain  {0,1,2,…,N-1} 6  All subsequences are ordered 1  e.g. 3  Task: merge subsequences into dense sequence 7 B  Approach 1: merge H=2 lists 0   cost O(N log(H)) 58  many (H) cursors needed cache thrashing 85

VLDB 2009 Tutorial Column-Oriented Database Systems 135 Cache-Conscious Joins Radix-Decluster •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04 2 4  Red column forms a dense domain  {0,1,2,…,N-1} 6  All subsequences are ordered 1  e.g. 3  Task: merge subsequences into dense sequence 7 B  Approach 1: merge H=2 lists 0   cost O(N log(H)), 58  many (H) cursors needed cache thrashing 85  Approach 2: insert by position   many (H) sparse passes over the result no cache reuse

VLDB 2009 Tutorial Column-Oriented Database Systems 136 Cache-Conscious Joins Radix-Decluster •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04 2 4  Red column forms a dense domain  {0,1,2,…,N-1} 6  All subsequences are ordered 1  e.g. 3  Task: merge subsequences into dense sequence 7 B  Approach 1: merge H=2 lists 0   cost O(N log(H)) 58  many (H) cursors needed cache thrashing 85  Approach 2: insert by position   many (H) sparse passes over the result  no cache reuse  Radix-Decluster: insert by position with sliding window

VLDB 2009 Tutorial Column-Oriented Database Systems 137 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 138 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

3 clusters

VLDB 2009 Tutorial Column-Oriented Database Systems 139 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

destination positions

VLDB 2009 Tutorial Column-Oriented Database Systems 140 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

Projection column (still) in wrong order

VLDB 2009 Tutorial Column-Oriented Database Systems 141 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

Result column to fill insertion window of size 2

VLDB 2009 Tutorial Column-Oriented Database Systems 142 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 143 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 144 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 145 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 146 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 147 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 148 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 149 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 150 Cache-Conscious Joins Radix-Decluster In Action •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

VLDB 2009 Tutorial Column-Oriented Database Systems 151 Cache-Conscious Joins Radix-Decluster Memory Access Pattern •“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB‟04

 Random access only to sliding window(<< cache size)

 Only compulsary misses  cache-conscious

VLDB 2009 Tutorial Column-Oriented Database Systems 152 Cache-Conscious Joins Radix-Decluster Performance Tradeoff

 Radix-Decluster prefers small W (i.e. vertical fragmentation)

Read Also: - Jive Join - Slam Join

“Fast Joins Using Join Indices” Ross, Lei, VLDBJ‟98

VLDB 2009 Tutorial Column-Oriented Database Systems 153 Cache-Conscious Joins Pre-Projection vs. Post-Projection

“Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos,Shah, Wiener, Graefe, SIGMOD‟09

VLDB 2009 Tutorial Column-Oriented Database Systems 154 VLDB 2009 Column-Oriented Tutorial Database Systems

“MonetDB/X100” vectorized query processing MonetDB spin-off: MonetDB/X100

Materialization vs Pipelining

Tutorial Column-Oriented Database Systems 178 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05

Observations:“Vectorized In Cache > 30 ? 102 ivan 350 104 peggy 750 Processing” FALSE for(i=0; i x) TRUEnext() - 30 * 50 morevector time = spentarray in of primitives ~100 FALSE less in overhead 102 ivan 37 7 350 104 peggy 45 15 750 processed in a tight loop - 30 PROJECT primitive calls process an array of for(i=0; i 30i++) ? values in a loop: 7 CPU cache Resident 15next() 101res[i]alice =22 (col[i]FALSE - x) CPU Efficiency depends on “nice” code 102 ivan 37 TRUE - out-of-order execution 104 peggy 45 TRUE - few dependencies (control,data) * 50 for(i=0;105 victor i

Tutorial Column-Oriented Database Systems 179 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 map_mul_flt_val_flt_col( map_mul_flt_val_flt_col( float *res, float *res, int* sel, int* sel, float val, float val, float *col, int n) Tricksfloat being*col, int played: n) { -{Late materialization - 30 * 50 if (many selected and narrow) next() for(int i=0; i 30 ? for(int i=0; i

Tutorial Column-Oriented Database Systems 180 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 MonetDB/X100

 Both efficiency  Vectorized primitives  and scalability..  Pipelined query evaluation

 C program: 0.2s

 MonetDB/X100: 0.6s

 MonetDB: 3.7s

 MySQL: 26.2s

 DBMS “X”: 28.1s

Tutorial Column-Oriented Database Systems 181 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 Memory Hierarchy

X100 query engine

CPU cache

RAM ColumnBM (buffer manager)

(raid) Disk(s) Tutorial Column-Oriented Database Systems 182 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 Memory Hierarchy

X100 query engine

Vectors are only the in-cache representation

RAM & disk representation might actually be different

CPU (vectorwise uses both PAX & DSM) cache

RAM ColumnBM (buffer manager)

(raid) Disk(s) Tutorial Column-Oriented Database Systems 183 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 Varying the Vector size

Less and less iterator.next() and primitive function calls (“interpretation overhead”)

Tutorial Column-Oriented Database Systems 185 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 Varying the Vector size

Vectors start to exceed the CPU cache, causing additional memory traffic

Tutorial Column-Oriented Database Systems 186 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 Varying the Vector size

The benefit of selection vectors

Tutorial Column-Oriented Database Systems 187 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 MonetDB/MIL materializes columns

CPU cache

RAM ColumnBM (buffer manager)

(raid) Disk(s) Tutorial Column-Oriented Database Systems 188 “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR‟05 Benefits of Vectorized Processing

“Buffering Database Operations for  100x less Function Calls Enhanced Instruction Cache Performance”  iterator.next(), primitives Zhou, Ross, SIGMOD‟04  No Instruction Cache Misses

 High locality in the primitives “Block oriented processing of relational database operations in  Less Data Cache Misses modern computer architectures”  Cache-conscious data placement Padmanabhan, Malkemus, Agarwal,  No Tuple Navigation ICDE‟01

 Primitives are record-oblivious, only see arrays

 Vectorization allows algorithmic optimization

 Move activities out of the loop (“strength reduction”)

 Compiler-friendly function bodies

 Loop-pipelining, automatic SIMD

Tutorial Column-Oriented Database Systems 189 “Balancing Vectorized Query Execution with Bandwidth Optimized Storage ” Zukowski, CWI 2008 Vectorizing Relational Operators

 Project

 Select

 Exploit selectivities, test buffer overflow

 Aggregation

 Ordered, Hashed

 Sort

 Radix-sort nicely vectorizes

 Join

 Merge-join + Hash-join

Tutorial Column-Oriented Database Systems 190 Column-Oriented

Database Systems Tutorial

Conclusion Summary (1/2)

 Columns and Row-Stores: different?  No fundamental differences  Can current row-stores simulate column-stores now?  not efficiently: row-stores need change  On disk layout vs execution layout  actually independent issues, on-the-fly conversion pays off  column favors sequential access, row random  Mixed Layout schemes  Fractured mirrors  PAX, Clotho  Data morphing

Tutorial Column-Oriented Database Systems 209 Summary (2/2)

 Crucial Columnar Techniques  Storage  Lean headers, sparse indices, fast positional access  Compression  Operating on compressed data  Lightweight, vectorized decompression  Late vs Early materialization  Non-join: LM always wins  Naïve/Invisible/Jive/Flash/Radix Join (LM often wins)  Execution  Vectorized in-cache execution  Exploiting SIMD

Tutorial Column-Oriented Database Systems 210