Vectorization vs. Compilation in Query Execution

Juliusz Sompolski1 Marcin Zukowski Peter Boncz2 VectorWise B.V. VectorWise B.V. Vrije Universiteit Amsterdam [email protected] [email protected] [email protected]

ABSTRACT conditions used in Joins and Select, calculations used to in- Compiling queries into executable (sub-) programs troduce new columns in Project, and functions like MIN, provides substantial benefits comparing to traditional inter- MAX and SUM used in Aggregation. Most query inter- preted execution. Many of these benefits, such as reduced preters follow the so-called iterator-model (as described in interpretation overhead, better instruction code locality, and Volcano [5]), in which each operator implements an API that providing opportunities to use SIMD instructions, have pre- consists of open(), next() and close() methods. Each next() viously been provided by redesigning query processors to call produces one new tuple, and query evaluation follows a use a vectorized execution model. In this paper, we try to “pull” model in which next() is called recursively to traverse shed light on the question of how state-of-the-art compila- the operator tree from the root downwards, with the result tion strategies relate to vectorized execution for analytical tuples being pulled upwards. database workloads on modern CPUs. For this purpose, we It has been observed that the tuple-at-a-time model leads carefully investigate the behavior of vectorized and compiled to interpretation overhead: the situation that much more strategies inside the VectorWise database system in time is spent in evaluating the query plan than in actually three use cases: Project, Select and Hash Join. One of the calculating the query result. Additionally, this tuple-at-a- findings is that compilation should always be combined with time interpretation model particularly affects high perfor- block-wise query execution. Another contribution is iden- mance features introduced in modern CPUs [13]. For in- tifying three cases where “loop-compilation” strategies are stance, the fact that units of actual work are hidden in the inferior to vectorized execution. As such, a careful merging stream of interpreting code and function calls, prevents com- of these two strategies is proposed for optimal performance: pilers and modern CPUs from getting the benefits of deep either by incorporating vectorized execution principles into CPU pipelining and SIMD instructions, because for these compiled query plans or using query compilation to create the work instructions should be adjacent in the instruction building blocks for vectorized processing. stream and independent of each other. Related Work: Vectorized execution. MonetDB [2] 1. INTRODUCTION reduced interpretation overhead by using bulk processing, where each operator would fully process its input, and only Database systems provide many useful abstractions such then invoking the next execution stage. This idea has been as data independence, ACID properties, and the possibil- further improved in the X100 project [1], later evolving into ity to pose declarative complex ad-hoc queries over large VectorWise, with vectorized execution. It is a form of block- amounts of data. This flexibility implies that a database oriented query processing [8], where the next() method rather server has no advance knowledge of the queries until run- than a single tuple produces a block (typically 100-10000) time, which has traditionally led most systems to implement of tuples. In the vectorized model, data is represented as their query evaluators using an interpretation engine. Such small single-dimensional arrays (vectors), easily accessible an engine evaluates plans consisting of algebraic operators, for CPUs. The effect is (i) that the percentage of instruc- such as Scan, Join, Project, Aggregation and Select. The op- tions spent in interpretation logic is reduced by a factor erators internally include expressions, which can be boolean equal to the vector-size, and (ii) that the functions that per-

1 form work now typically process an array of values in a tight This work is part of a MSc thesis being written at Vrije loop. Such tight loops can be optimized well by compilers, Universiteit Amsterdam. 2 e.g. unrolled when beneficial, and enable compilers to gener- The author also remains affiliated with CWI Amsterdam. ate SIMD instructions automatically. Modern CPUs also do well on such loops, as function calls are eliminated, branches get more predictable, and out-of-order execution in CPUs Permission to make digital or hard copies of all or part of this work for often takes multiple loop iterations into execution concur- personal or classroom use is granted without fee provided that copies are rently, exploiting the deeply pipelined resources of modern not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to CPUs. It was shown that vectorized execution can improve republish, to post on servers or to redistribute to lists, requires prior specific data-intensive (OLAP) queries by a factor 50. permission and/or a fee. Related Work: Loop-compilation. An alternative strat- Proceedings of the Seventh International Workshop on Data Management on New Hardware (DaMoN 2011), June 13, 2011, Athens, Greece. egy for eliminating the ill effects of interpretation is using Copyright 2011 ACM 978-1-4503-0658-4 ...$10.00. Just-In-Time (JIT) query compilation. On receiving a query for the first time, the query processor compiles (part of) the Algorithm 1 Implementation of an example query using query into a routine that gets subsequently executed. In vectorized and compiled modes. Map-primitives are stat- Java engines, this can be done through the generation of ically compiled functions for combinations of operations new Java classes that are loaded using reflection (and JIT (OP), types (T) and input formats (col/val). Dynamically compiled by the virtual machine) [10]. In C or C++, source compiled primitives, such as c000(), follow the same pat- code text is generated, compiled, dynamically loaded, and tern as pre-generated vectorized primitives, but may take executed. System R originally skipped compilation by gen- arbitrarily complex expressions as OP. erating assembly directly, but the non-portability of that // General v e c t o r i z e d p r i m i t i v e p a t t e r n approach led to its abandonment [4]. Depending on the map OP T col T col ( idx n ,T∗ res ,T∗ col1 ,T∗ c o l 2 ){ compilation strategy, the generated code may either solve for ( int i =0; i

- 1 2 5 10 20 50 a n fo n 30 at p a D- b o . 16 c ut i u l t co e ro t S a g i t ca v n n il ic . 4 - ter t it r h le 2 a o h i si h 2 l S ec I st o er SS fi per tuple Wi 16 i e und re i 32 8 4 2 1 a M s a in o c c t ex 1 16 Comp n t I fo n t n d m rs -m t ht” h i s ast Inter { ru to M up u y : c g t , mes , o io E s D b r e c u p t t t erp hth th o in b tru o p 1 ri D c Ex p ei si d o n i m l l erim p mi s, - S een n preted . t ti ti in es n o g s 1 z b iff mpi s n w s n i ro Pro I tru p n en li ret ed o . o p c to v 64 a yt M s gl slo p a g e sse -u o o ic eren il e n t n in e li ne tru c a c io e era re) e- ed c e d s en s: D es c A c a “ n c dd uti t se w to l s a esc t j t n a ti S p h s b r ati -b ll res c si e io t h c er ti o S w ol re s I t i t o era yt o ri c t i p o c f is n M 22 n io n w o o it io IMD u n o i e n rib o l n v ze t er – n u v 16 g on e e a g n th - n. d sa s f s n ec D” ork n st e w O3 t ec 2 512 128 lt , o s 8 e d xp io (2 i v wh , Lin e th a v nd c m o in n a > , -b , to wi to a nd d t al S n erh mo u n n co *2 a lu h o res fo e ic i v Th r- t l s , i t) rea 4 rs a c th t nd u t c se r n d a c aro up e d es h u r + lrea l si ro a h x o l d ea t a I t oo si oa c m V e e la s M dp nd g wh e im dd ern de z , l 6* 16 ector siz le a “ , m t i ta it o eed exp cc d -b es. n c u t p und ori n d D. m n lo s d p K8K 2K io y o mo 3 a st -b sm es it l rt i i o , wit t le y ro e n mp ) c u g on w is c n - io x ip s ru re i h s erfo nc a g z S 4 lt o T e W ex upp t 86 e in o l d du in n e ati t l su a u I oa e ne h ssi r c i h me i y res in i h M p , o t ro e h le c ld fi ti e re e i d c e he c re in o eed ly rm sy d p npu s ed mark: o fo ra lem o r o D” ed u h und et wi n tru u / g a n b o n c n, se 32K req st ( a u e lt st en rt t 8 rea an t a th tio e V V Compiled, SIMD V Compiled, no−SIMD Compiled, SIMD−sht s s i i t v dd h r al h res cc 8 n ems d s, ector ector ector b li il t l en h c o ersi p in a era si a e - g y u t c S res, 60 o 4 ig (w b n ro s to n it 128K r 2 a 12 io t a e t n ire 11 u e f -b n in IM t i ed nd e . io s t g nm . a l lt t iz iz iz h n wit re c o d it xp GB A c y g w [1 . le- ess d s ed, SIMD ed, no−SIMD ed, SIMD−sht n e s .6 t n in s I 0 et a s h te w wi ub for ren l n o , D 3 in fo a 4 v n (2 g L1 l o 7 S h o are b erio e e ic , , l su ec ed th or F n G c f } lo u g o yt tra th n in a 1] h it M4M 1M o d e i in lo ro ig l f . r t V c Hz t dd cc it e a a e ere) I m t e l o wi . h R ra M 32 c a nd d u y s ec o eg r c u h r-s o tu s t a esu p d t f i The re wi A g m a t t th ru t n ub- D. t a io lc d t -b Ne- ers a s h h h d in , nd p or- w M t iz L2 nd , i ri- th a u er n ly e- e- c- 1 s- l is l i o 8 1 1 g e e t t t - . . , , c Wi m fo wi ( i t ex o i t v i un a m t a 3. du wi t fu b i m wi T p Th WH t wh c v o 2 a m fu t c o vi a c v i qu n=1 m u du t fo Th “ n n n s n o o est up iv ried io o u a a f re b f b e e e yt l re u co to v r rt r e). g c emo on ost en ak n n res, R col ERE lc u rsi ct ry d a Th W S I t t t se c le se e it i e c n e in w lo o e o i se es n pl , - ( t io h h c c fa e a ct le- em se e. ry el in s u t s n u i m b , s mp . l o 1 t e t t d rv e bu t h l t h on d a to e o a a ti n a eed l i led yt io io exp ec e c o AESTUD CASE ri . e a qu s c c c t Th g o S b ca h at rn a d v p a i ex in ry p l up Ho lo nd h a rd i it o o o s ga e bu n -at- s ze I co m e t n n e ere t t n t / a o ti e i c ro m H t v c n mp mp e M n n i o e o mi -a ler era ess up ssi is s ra o es s s s c o n p und S io e ed o er a l a “ a a eri o ti a ry d ju ti w e s st w l t F to i n ere, a sa ros et du mp 2 v1 < 1 p n a c u c lu D in n - t p n : in w a ll a / n o c me ss e few le- ig ev t a- - h h an o i i , t n o fo mn o t te s res b bu co eff me h k m t b le la b n im ork 4 t t at s o rag o o e g p b c n g w c ga w o s ed is u u so erp b ma o yt y i erfo -b tim e r i a g t li r d - t n th t en e, ro la rim l . ec o m t o re i t rn o w r, o g d si Pro en y iv t . e t io er; - n sh st o , i ifi g ri f wh l t in p h of e y rt t up wi u -a n t e u h p e e a ce z t n h e en T n ll tu S h ret e ca th n e te w erc i mi if ic th n e l c 1, e S o t te rm a T u rat on il e it e n d r o - N col AND s I a fo o up sh go o . fi se “ a erfa s a ssi a le i w e se fu wh je s t I p se fu M ys p era den ed w ll e c t h w u f ee iv e ( t n co s ed M im t in t qu r- en co p h l c s erc o o an ct Su ju ub e es iv r u i a lec n le- i ge t e- u e n n D te c wh f o s es o l o b t o ws a n n erfo t i lo yfa ly re D m oo re or ” c in t o to si c e rres e c t g h c th f rma e c n eg st a s v a a o mpi ly , ce Y e m t in n c -o , a en t ry m h e t p a s” o su o es n tt p t ec ssu t- tt io o h s o io a re p io o to era h t- g ma e off n m la ers su h o p t ev : t t h o is g il p o pp p p t rm p c a 2ADcl v3 < col3 AND v2 < 2 t a -c en e i t ere. h om p n m h a o n t e n . p a n f d c n o a [ n ed t l ag st n c o o il e p SELECT era re - o era rp 1 nd a a ery m c ly e ers co fo - - tu la s om i wn o s o te ti t o ). e n e W k u in c , ed o t il ma a t riz ti o ]) t l a er : t c h m nd L2 nd i o e ig h g in e ld CP rt me. r- ri in a mp ri a ti m c on n d om t st p tu t me ex g , e if t th e mp t in H e-fl q ti h t p a h A b p ed n or sed c ( lo a l mi un io g t li ( ma on ( in , ra tu i a p c e- an b u th u n i < o i e p e le il o up o en p ti t A 7 e re 4 n mp U ler t o i k b a il o HI n su ery n u e o t n l b n at la i y a ec y te t “S it 5 , i p o p A e d e ll l lg at p in led a o d a T . s a e A . o s g vs p c v . e iv en ik a t- le : %) n le- rim sit t t e se c ie t th s pp i Q e, t P h g t s l l_ - o to W c h a c a lg o l o i I h io l w a fa es eme it e t p . ik on Th y l_ in a oa rev g ea s, M ri lea re ev I e t i in n en U L1 h -t a e u 1 a . i nh o rim ou tax l th , ro o lc n i i M s c a t e t ” it n a o mi c t- t e on o n 2 a extpri .2 pp ri n ee D- E t in rse i d a t n t h Co ro g o u lo e bu rg f me, rl P p ) wi t ea e in a i e or st co yS o h w l a H l n ib s th m c v en c0 i m , cy io la d u , c c era it e y wi o ro n n st ro - e sh ted du yc b e l o es o OLA g ra , t I in n h l ro mp ti t t a i a u nd s ik v et n b e 3 m -c QL t ed iv 00( Q h c t l pp d u c h b . u ga tes 1 nd a in v jec e i a l th t” me s s l s o in v te c les at c e -b t o d a l ( o c o e/ c el es ral c UE ec a c c A it ea d a n ec e ce e 23 resu n a t ls 2 . wi h u h wh o m in eed k t a es re gy ) M n o t t h t l en p ac sel ri a wi rg h / s t t rtu l t P n in t o on v to t a up em t op t l w w oo cre a p is em up o th ec th tu o ta a st w w h vs o . n o o o o y l i v si c h ere re c u [ co r- il t th e so n f p s t riz qu b n e n u t g S 7 l h o e h o e l im a u p i t n a s a s t p mp t ed b p le) h a -l e-a a ev siz ] l i r e ct i QL 7 h ra u a u ks v ru m r n 22 t exp o la -o b i t s y cc cc v v l s h t la le e ere te ca ev i eri t ed n e) e ad i g ec l l h ep u qu c p e n c ly ec ec o a e s s fi d ies b d , to a a a t v . es e, u ss ss a n en g t l st ep p t e yc t rim , a ri t iv u . e el , rk dd nd , t erh e - [ n es v in lo se a S d o fo t t in ma re A o es r b h 1 h o . wi a b ra exe s w za b wh ori ori t era ec im y e ra t t ma c l in i t fo Cl e ], yb r-s s a -t oo e A g e/ h re rea n u ssi st lo a ti en e su en it fin te o th d d ea F ex n t in to m i t i t tro a im ll r c PI il z z th th t o s n b n o o o io cu c tu ely s ers est t / rid iv re- ew iz i ia ig ll a o g si ed ed h ed c a c n t n a a k- k- se es se es re o 1 d ly ly rs e- e- al s- i n h n n y y o e e e e e e e s s r t t ------. . Algorithm 2 Implementations of < selection primitives. Algorithm 3 Four compiled implementations of a con- All algorithms return the number of selected items (re- junctive selection. Branching cannot be avoided in loop- turn j). For mid selectivities, branching instructions lead compilation, which combines selection with other opera- to branch mispredictions. In a vectorized implementation tions, without executing these operations eagerly. The four such branching can be avoided. VectorWise dynamically se- implementations balance between branching and eager com- lects the best method depending on the observed selectivity, putation. but in the micro-benchmark we show the results for both methods. // ( 1 . ) a l l p r e d i c a t e s branching ( ” l a z y ”) idx c0001 ( idx n ,T∗ res ,T∗ col1 ,T∗ col2 ,T∗ col3 , // Two v e c t o r i z e d implementations T∗ v1 , T∗ v2 , T∗ v3 ) { // ( 1 . ) medium s e l e c t i v i t y : non−branching code idx i , j =0; < idx s e l l t T c o l T v a l ( idx n ,T∗ res ,T∗ col1 ,T∗ val2 , for ( i =0; i n ; i ++) < < < idx ∗ s e l ){ i f ( c o l 1 [ i ] ∗v1 && c o l 2 [ i ] ∗v2 && c o l 3 [ i ] ∗v3 ) i f ( s e l== NULL) { r e s [ j ++] = i ; for ( idx i =0, idx j =0; i

iv 0 200 400 600 800 0 5 10 15 20 25 g d u s h l a ssu p fo . V prob robe a a t t e, lt l o re e n it build.col erfo o y w h a and Ea du nd cu . f ...... 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 2 ip bu n o st y b t o mes p > t : t c u Th ry a s l c s e.key1 bu , h h resen a e t; rm (i si m bu h C o e ft a 1 t s bra n i m c t c n l ic h V s er o in up le bu k e te s e ix p ild ec imp nj ro b it a e sp n g t g ro c ts n b et le i h bu ed nc gi ed t ,b 1, -b k ld er) unc a ra d o er s bu = b e et t in c n le rWi a e c er qu mi h l e n a i g rela em i k n ta n du n pp c co c a uil et t c bu c i fo o Y h in n ild -jo th an h io se Fi s lu n e i d ro t m r : n a c Selectivity ofeachcondition c t d.co n i s e epen – s t m k i rra g g h a on a v .key HASH s pre a n ei o o et u in bu a upp c ec rk th o n me p t re i h c , h y -c s, n f h s r o 2 build.col l2, t es il e a a ed d d . er o c B N pr AND 1 nd h s th ir sh V V Comp Comp Comp Comp d o s 3 en b o riz , se i e a ector ector rt ic . ct A n e e wi b l w i b rela it w lec n cy e c c c a s ed ra e Th el o io io o au ol in c ern bu t t JOIN ., 1&2&3non−br ., if(1){2&3non−br} ., if(1&&2){3non−br} ., if(1&&23){}(lazy) b ff rs iz iz es ti h t org l n n n li t g o o iv set ed, br ed, non−br exe s i c c io t as on si si , n s s t s e i h k i it obe.ke n h s a z n b o s et i Ha m v su v th y n e cu u n n o a eh [ a o s w c 1 anching o g c f i ea s, N ro c z . l ra e ma s o h 4] n t a it vi 3 h a ed h o nd e n v und o s . n wh rem h n rs. a n xt t 2=b = y2 s r c io e anching Jo y g up s I g o u l t th i t es b t -o r e i mp h tio h sin in u h k c . (computeall) e a T c le c 50 e e , i a ff h s ey a ca o s in ti s v ed h set t bu fi sa g n : n o i i %. e wh is s i u vi s n uild.key2 t rs s n s: k DS m a s it a co t a , i o g eys a t e t d in n en e e y p wh m is tot n c p v t M s o V h k h k s a c red n ai w is th t ey a a e u ere lu i ec es h c o n al y er re i s- t- n h e e e . s r - - - . A } } } } } } } m h m m m m h pr l go t t a a a a a } } } } } f f f f i i i o p p o o o o p p p l f f f H H P // f // H // f wh l ~ ~ ce ri o o or or r r r r o o // // // /ad // ~ p p f f f ( ( ( e e e r r i i ~ os, h its r f o o o c r thm f f e ← ← e ( ( ( dur o o l l l s s s F I I ile a M // H P f C // R e e k c e h ~ s s s e e r r r ~ s s ( ( ( i i i s s t n e e e r r r or k e ~ s t h ac ac er d d d M p u i e e e os, heck c e e e e ← p p ( ( i s s s it l l l ] i [ ] i [ [ ] i [ map map tc atch p u h ~ c ← ~ d a h T T x x x C C ← M m a a a e s s s C c ( ) ) ) i i i i i i o o ati ia h h e h atch hin ~ d d d d d d s x c e k ∗ ∗ { { { h v v v i s s v [ [ [ sel hain an ac T M heck H a l l 4 map h r r atch d x x x x x x n ~ ~ { { { / / / i i i a r map ve [ ] i [ i i i s s s t e e i Tp i h = = = n n n ← x k T ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ m = = e = atch bitw hash g dida k k ← c n ma su e e e d o i ~ c n V i nonz h g g g x oku = e e ] ] i [ l ] ] i [ l ] ] i [ l i i r c ← o x d 0 0 has 0 th a r i l ec HA B H e c map y y ) = = ( t n ( ( fo c c r c i t s s s ] ] i [ i c sel l ← inin i ; t t , , 0 o x r n ma n h e e f s s k k = o o f k k ( c t t o e e e e i , [ o iseand te c i i ; ehash b v 0 0 ht to o ll p ( o ← H a o a o h- s h etch g { S = = a l l r x i i e e l , , l l l ec e l o t n K i ; i ; 0 c map owi er 1 1 u d d < H inin 1 ~ nonz y y ( of ) t g l ri ] ] i [ , ; ] ] i [ val on l i ; n o 0 0 e l V t ======1 d < check c n i T T x x ht n m s s ( ( t ook ] i [ , , o mp u i ; i ; or l z k d ) < < < , = = = = o o x T d n ∗ ∗ n c i ( ( c c - [ [ i ; mb ed e u a B g x d P P k ( T T an l n n n n n T ∗ { h g v y o l s s i ; s s ( r v e t R H ey t ook ~ ~ er up ~ < < NU NU NU ~ x H [0 ) ∗ ∗ k os, os b b c n b n r c echeck e e c c e e l ~ i ; i ; i ; i , i , y er T c i d B e 1 ++ imp h , n n ( ∗ ˆ ha o a a o e ! ] ] i [ l ! ] ] i [ l a c ..N l l e o , ec eri idat v y o K do ) K d at ++ do n { b x ) s s LL LL LL ( c N . . o initial l s i ; i ; [ ] i [ x l ~ ~ up V c c l d C HA t x 1 ++ ++ ++ e e k e 1 ( i c t o a i t i , l om tr ) v or fi o ) n x ec , ++ ∗ { m − o − vec heck ( s i l c e v , , , l [ i , i , [ [ alue c ) e c ib ~ V emen next d l 2 ++ ++ e s ∗ S l p i ) i ( c s o at ) ) ) t a pu k 1) 1 1] d d C n H x m utes or h p o o o ) ; o o e i , K l i t = = ey ] d d ion x x e ( , heck d o ( c r s t i m m m f i . ] ] i [ l l , at ) ) ta , ~ i H = K x x 1 T , o h 1 dpo p ad n e { s K ( c b b c [ x M ~ H i i i P che d tio , do s n r ∗ ∗ ∗ { { m t ha , o a a i , [ [ , 1 ~ tt tt tt i P ~ a x B ~ ..k k os, p atch c it l s s i , ; i , s c ~ a , ~ do t d n T 2 e e e ++ os, ∗ e e p p o i o e o ) f i s K t s io ~ n T (in x d d d n ∗ s o o c ; ] ] ] i [ l ) [ [ ] i [ l l M i d h o n ∗ ∗ } s s p p ( i , s | | M ) s , ] T x n n ∗ ∗ ∗ i r o o V e ) atch, i , i , ; ] ] ] i [ i ~ o d − = , e / / / o m ∗ t atch ) s s k c l ~ f R d i x n s ey } } } e ) ; o c [ [ a o d d − x m h { 1 m ~ r o l , s s t i ..v n x x 1 n e a ∗ c z e e , l V ) p a h s sh P e ∗ ∗ ) ( ] ] ] i [ l ] ] ] i [ l , next t t (ou ~ { r m y i c i , , os, d n o h c s s p a x h e e ; ] i [ t) t ) rob M H c r l l ) T h n ∗ ) ) atch ~ ) ) , , i n { { ; ; r g e } ) . s , hash bucket Algorithm 5 Fully loop-compiled hash probing: for each values heads V (DSM) NSM tuple, read hash bucket from B, loop through a chain HB next key1 key2 val1 val2 val3 in V, fetching results when the check produces a match

0 2 0 x x x x x x for ( idx i =0, j =0; i

2 3 4 3 0 102 1002 73 v Apr i f ( key1 [ i ]==V[ pos ]. key1 && 3 1 203 2003 736 w Oct key2 [ i ]==V[ pos ]. key2 ) { 4 r e s 1 [ i ] = V[ pos ]. c o l 1 ; hash value computation r e s 2 [ i ] = V[ pos ]. c o l 2 ; r e s 3 [ i ] = V[ pos ]. c o l 3 ; break ; // found match Figure 3: Bucket-chain hash table as used in Vec- } torWise. The value space V presented in the figure } while ( pos = V. next [ pos ] ) ; // next is in DSM format, with separate array for each at- } tribute. It can also be implemented in NSM, with data stored tuple-wise. Vectorized DSM Vectorized NSM Vectorized Hash Probing. For space reasons we only Compiled NSM discuss the probe phase in Algorithm 4, we show code for Compiled DSM the DSM data representation and we focus on the scenario when there is at most one hit for each probe tuple (as is common with relations joined with a foreign-key referen- tial constraint). Probing starts by vectorized computation of a hash number from a key in a column-by-column fash- Cycles per tuple ion using map-primitives. A map_hash_T_col primitive first hashes each key of type T onto a lng long integer. If the key is composite, we iteratively refine the hash value using 100 200 300 400 500 600 700

a map_rehash_lng_col_T_col primitive, passing in the previ- 0 ously computed hash values and the next key column. A bitwise-and map-primitive is used to compute a bucket num- ber from the hash values: H&(N-1). To read the positions of heads of chains for the calculated buckets we use a special primitive ht_lookup_initial. It be- haves like a selection primitive, creating a selection vector Match~ H TLB misses per tuple of positions in the bucket number vector for which 0 2 4 6 8 10 a match was found. Also, it fills the P~os vector with posi- 1 2 3 4 5 6 7 8 9 10 tions of the candidate matching tuples in the hash table. If the value (offset) in the bucket is 0, there is no key in the Number of fetched columns hash table – these tuples store 0 in P~os and are not part of Match~ . Figure 4: Fetching columns of data from a hash ta- Having identified the indices of possible matching tuples, ble: cycles per tuple and total TLB misses the next task is to “check” if the key values actually match. This is implemented using a specialized map primitive that combines fetching a value by offset with testing for non- P~os with selection vector H~its becomes a pivot vector for equality: map_check. Similar to hashing, composite keys fetching. This pivot is subsequently used to fetch (non-key) are supported using a map_recheck primitive which gets the result values from the build value area into the hash join boolean output of the previous check as an extra first pa- result vectors; using one fetch primitive for each result col- rameter. The resulting booleans mark positions for which umn. the check failed. The positions can then be selected using a select sel_nonzero primitive, overwriting the selection vector Partial Compilation. There are three opportunities to Match~ with positions for which probing should advance to apply compilation to vectorized hashing. The first is to com- the “next” position in the chain. Advancing is done by a spe- pile the full sequence of hash/rehash/bitwise-and and bucket cial primitive ht_lookup_next, which for each probe tuple in fetch into a single primitive. The second combines the check Match~ fills P~os with the next position in the bucket-chain of and iterative re-check (in case of composite keys) and the select > 0 into a single select-primitive. Since the test for a V . It also guards for ends of chain by reducing Match~ to its key in a well-tuned hash table has a selectivity around 75%, subset for which the resulting position in P~os was non-zero. Match~ we can restrict ourselves to a non-branching implementation. The loop finishes when the selection vector be- These two opportunities re-iterate the compilation benefits comes empty, either because of reaching end of chain (ele- of Project and Select primitives, as discussed in the previous P~os ment in equals 0, a miss) or because checking succeeded sections, so we omit the code. ~ (element in P os pointing to a position in V , a hit). The third opportunity is in the fetch code. Here, we can Hits can be found by selecting the elements of P~os which generate a composite fetch primitive that, given a vector of ultimately produced a match with a sel_nonzero primitive. positions, fetches multiple columns. The main benefit of this m n t P in t ern h o go t eq v o F wh p su rat F t c ma t p fro c lo V a c is p w o h h h a a o v n re e o a ro a erfo ul Cycles per tuple ig ara c . u a e e c c lu rsi c u er rit vi o t ss o t m th a i se k h h s h n b e h a b c . l u h l t m c imp a n resu es es e e h on s

it 0 100 200 300 400 500 600 l, e h e ta V e qu t rm a a re C m t t ll t g rem A h l n m y h si h c rd h fo ec o w in 4K fet k v e i N omp . e o h o s en a n m e is o f ze ey ec a ro l e ll v l un 5: 5 w e es t si S t fro ed f t bu n F v c u a o er t s o t a M t vi M n 0 a s h t ce a TL he in riz 6 64K 16K ia win ig l V V P Compiled, NSM Compiled, DSM a o t fro c ls h h fe .6 re t b g m l es ector ector c ar Ha n nd in riz o i u ip o l t u e le o k e efo o c lat tc 4 l g a tial Compilation lu h o ws e B a re m im et V le mory is th f pr l G fu t g t t f ed e ar c e h c sh m m h Neh iz iz io sp ma est a re ra - a ce Hash tab io B wh es e N a n ll t ying hashtab it ed, DSM ed, NSM e c th 4 ob rly o NS se u n n h nd h a o /s sa a nd u S ed ss n. joi n ed c s y e th tab c sh a e l 256K y t t M a h lo te e e fet e M , m . of in p h st su o h (h l e eded . e n . o (6 c t le siz em n A I rev e t o P ( rn m v a e h , a a w l , t a I c h ∼ le p g N n e en R n sh 4 erfo v nd t c i h cc p c nd c a s er at re: i s eri M4M 1M e (tuples) ext a h ik b 100 S h s ac io o es e o i lo n c lu p y fu e gh t p M e ac in ec rresp i sit ex e e re ks p ore u si h v o tes o c le siz rma rob ss ro DS . e tra ll o e r. s g n a es k v p p h mp su 3 c t: i s o y .g t al in o s) b . p s t eed ) sp Th w ) si ses m rg a o M n , l c e s , a . c e i u o i , t n o g b it a l di a h n n ng a ss om emo m y B v a s a es nd a c p ss e a 6 64M 16M le n c in p h c ec t er n i t e c u ffe t ec ro nd i ol em e s th h b . s s ce c . h is o g t o i a p to o t ti e co n ei o u o a p si e p n hu ry a O c a rm e a il re L ssi o m g ory ma mn u h 828 890 o c n s ro u c k ti mi rt ed l a c n n d c a e em se e, u s i ey a g s rely p n rea o mo ev bu a req n cu m ed ia i b c f mn o e vi t il la n c s l t t: h g a a e b ic f ffic . l ru a e ed

l 0 100 200 300 400 500 600 rs fo c e c lg v sio r, te rea un a s su o u i n k c k h ra W - n e iz a em nd , di ci l f r C o ...... 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 y li est o a eys et ie ct i a bu c b n fet t a evi m ri t nd n wi HI e t a e d PU rg eq h fa ffe n -a , t wi h o e ll o th s o s y l p V t im , t t ri oo a e c t er c o ry y u Q t f n - ry ar et il g Selectivity ofmatching(fr p h h a d ev a o to re n ti h v m m ze co a g v c ra a ying selectivityonHTwith16Mtuples es UE d nd p t t p f ec en w me li a a a t ti h fet r ic t m epen la n refe h d -c m is h a th nu t c l lu ee o p ag to o go t y t h 10 o a ll a t o o fo p , o c ed n it , e i ex [ en n n s e n re r- e l mp n es 7 h si si a il rit . tc ar r t a era h ]. s ferio es s d m si im t li th a mo p c t nd bu z TL fro p d ea en h h h it io s t y n hm, a la z il e o a a Al io in i ese e t ies es es, es, nd n o a s ed c n ce ce d t es re re m i c B i if n n n ll h g g a s s r - - k . o e f action ofmatchedtuples) t th arra e F t I i a i Th u wi t m i p o m p u h a d bu t t Th fa a m i a I 7 s t s mp t n h io h ig c n re n h Sp si i a a o en has o s ls p iss ax en t wil e ro e l t w t y rt l ea r un i n h n im el a ld h p eli eed e t e, t ec ro b t s ev n t g p t, ia ss h ra imu . v erfo d B l kn h ra a up Ne o re it e Hard n fro ec wh is rel u vi exp l l er, b nd A a i oo , ly ed m si ed l n n , w fetc c b tabl o e nd t n at st h m s m m t a rm o c n o if i w ra u p o s g in un o co .r. c , a l p h c t rea ll riz i g im m ic l ou h b n io o lem an a d ti b n isi c is le h up o an o m -T n i a t. e e e a c wi u h n p i ed t f il lo re m t u -fri n a b p o p h nd h ou p t l st - ce ru a t p ec o cc lt e v g l n i a st o en ll a d il h a v e nd r re , -U in rea alu a d s. f-o ip rd t wi fet ic en n rc ed a e al p es a h i t r s effec s d

s 0 500 1000 1500 2000 2500 u fet wh s nd o ro s epen ed t a ta l h i ag s s, ue d rd m tru d e nd ta se o o c m n I t pp e v p it V x1x 2x t b f, p h l r o o n nd ic c st in fo u ar a i y ro b h r er l t ec h c b n e ly in ls s t w o e ria c i l en t h t ying n s ru h h g r- t t i d c pac i t o und un c rs , n t pac t sis is ill n h ip s o o io ri goo n e y ee r lo up u d c t g n th lo ra te g w a in d n tan a ru mi l t re e/ n e t p s a o ma e d e t in ev io a t me nd s. e p n l e umber ofb s p red e u jo es ct d d wh b id m fa ti a c en is c n s lo h Number ofb C er, c re s V p h d een k i a s tr s in Th V i i v mo a a ct en a te a f on PU (e e sp h n d t i il th e . t rd b n in u c d t hu t /x14 1/8x 1/4x 1/2x P o o h s rat k e tst it ti (d ti th g o xec e ct re h s ec s ry Mi t r f e ey e t art d . c o a e fo n a s riv t fu a mi e i d e fo v a iff es mo u v i n a o in on g C uc CP nd o re u requ mo ll ic uck a i l d e l a dd l t T v u n c at ia ly PU l t . g o e ak t kets onHTwith16Mtuples k t i y t s u e r ets /hashtab fo fo rib e st d h s h win h h c re io eys a 7 re l U re e i Th in i le h e e ly e t e en c o re o r u . rib o c n es n rn o ie o a n n ed n o n r b c : s m Su c wh s , eff , f h mp g t c n e t ( t a upp v l a e u a c o i u h h i w a y o V = pos t a di u n c n e, me ile(po ol c CP t rren u h t c i cc ec c e nd a s u to lt in n h ) h V[ mo i i ru t h h e sp min b li ff on le siz . led rren wh e i st es ti bu i o -jo d ev ain l si p pos th CPU U e Th x e rts ex o e v h bu o a re t s s le re o e cu n t i s es e w o nd e t i o le h h /6 1/32x 1/16x w n i ec n c t l o ] s..) , ld a s r c lo .ne y th p t n ere ly as d s a l c l s -l ) h k t o h a in u e a s l n c o wh o , t s a imp oo ere in a e [ f w o e t p p h a h s d ng t n o 3 e d b in t g e tf xt[p r io n ma fo ec a e epen s ill t t ]. d is s. d ec p it a c /T ah orm ext st lo 4419 5293 if e n t ths g r e th m ro , te u h CP ro h Th o p a s t ep . u a ru Th la a a is tc elp e h ta u th t a os] v en LB b ere v d a nd in a b h d a n se er t s. ly o e- c- ). U r- e, is s, h d ll a e e e e e s t - - f . . effect causes fully compiled hashing to be four times slower while maintaining clear modularization, simplified testing than vectorized hashing in the worst case. and easy performance- and quality-tracking, which are key properties of a software product. Experiments. Figure 5 shows experiments for hash prob- ing using the vectorized, fully and partially compiled ap- 6. REFERENCES proaches, using both DSM and NSM as the hash table (V ) representation. We vary hash table size, selectivity (frac- [1] P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: tion of probe keys that match something), and bucket chain Hyper-Pipelining Query Execution. In Proc. CIDR, length; which have default values resp. 16M, 1.0 and 1. The Asilomar, CA, USA, 2005. left part shows that when the hash table size grows, per- [2] P. A. Boncz. Monet: ANext-Generation DBMS formance deteriorates; it is well understood that cache and Kernel For Query-Intensive Applications. Ph.d. thesis, TLB misses are to blame, and DSM achieves less locality Universiteit van Amsterdam, Amsterdam, The than NSM. The middle graph shows that with increasing hit Netherlands, May 2002. rate, the cost goes up, which mainly depends on increasing [3] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. fetch work for tuple generation. The compiled NSM fetch Mowry. Improving hash join performance through alternatives perform best, as explained. The right graph prefetching. In Proc. ICDE, Boston, MA, USA, 2004. shows what happens with increasing chain length. As dis- [4] D. Chamberlin et al. A history and evaluation of cussed above, the fully compiled (NSM) variant suffers most, System R. Commun. ACM, 24(10):632–646, 1981. as it gets no parallel memory access. The overall best solu- [5] G. Graefe. Volcano - an extensible and parallel query tion is partially compiled NSM, thanks to its efficient com- evaluation system. IEEE TKDE, 6(1):120–135, 1994. piled multi-column fetching (and to a lesser extent efficient [6] A. Kemper and T. Neumann. HyPer: Hybrid OLTP checking/hashing, in case of composite keys) and its parallel and OLAP High Performance Database System. memory access, during lookup, fetching and chain-following. Technical report, Technical Univ. Munich, TUM-I1010, May 2010. 5. CONCLUSIONS [7] K. Krikellas, S. Viglas, and M. Cintra. Generating For database architects seeking a way to increase the com- code for holistic query evaluation. In ICDE, pages putational performance of a database engine, there might 613–624, 2010. seem to be a choice between vectorizing the expression en- [8] S. Padmanabhan, T. Malkemus, R. Agarwal, and gine versus introducing expression compilation. Vectoriza- A. Jhingran. Block Oriented Processing of Relational tion is a form of block-oriented processing, and if a system Database Operations in Modern Computer already has an operator API that is tuple-at-a-time, there Architectures. In Proc. ICDE, Heidelberg, Germany, will be many changes needed beyond expression calculation, 2001. notably in all query operators as well as in the storage layer. [9] ParAccel Inc. Whitepaper. The ParAcel Analytical If high computational performance is the goal, such deep Database: A Technical Overview, Feb 2010. changes cannot be avoided, as we have shown that if one http://www.paraccel.com. would keep adhering to a tuple-a-time operator API, expres- [10] J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. sion compilation alone only provides marginal improvement. Compiled Query Execution Engine using JVM. In Our main message is that one does not need to choose be- Proc. ICDE, Atlanta, GA, USA, 2006. tween compilation and vectorization, as we show that best [11] K. A. Ross. Conjunctive selection conditions in main results are obtained if the two are combined. As to what this memory. In Proc. PODS, Washington, DC, USA, 2002. combining entails, we have shown that ”loop-compilation” [12] The LLVM Compiler Infrastructure . techniques as have been proposed recently can be inferior http://llvm.org. to plain vectorization, due to better (i) SIMD alignment, [13] M. Zukowski. Balancing Vectorized Query Execution (ii) ability to avoid branch mispredictions and (iii) parallel with Bandwidth-Optimized Storage. Ph.D. Thesis, memory accesses. Thus, in such cases, compilation should Universiteit van Amsterdam, Sep 2009. better be split in multiple loops, materializing intermediate [14] M. Zukowski, N. Nes, and P. Boncz. DSM vs. NSM: vectorized results. Also, we have signaled cases where an in- CPU Performance Tradeoffs in Block-Oriented Query terpreted (but vectorized) evaluation strategy provides op- Processing. 2008. timization opportunities which are very hard with compila- tion, like dynamic selection of a predicate evaluation method or predicate evaluation order. Thus, a simple compilation strategy is not enough; state- of-the art algorithmic methods may use certain complex transformations of the problem at hand, sometimes require run-time adaptivity, and always benefit from careful tun- ing. To reach the same level of sophistication, compilation- based query engines would require significant added com- plexity, possibly even higher than that of interpreted en- gines. Also, it shows that vectorized execution, which is an evolution of the iterator model, thanks to enhancing it with compilation further evolves into an even more efficient and more flexible solution without making dramatic changes to the DBMS architecture. It obtains very good performance