+GPU: A CUDA-Powered In-Memory OLAP Server

Tobias Lauer Amitava Datta Zurab Khadikov University of Freiburg University of Western Australia AG Georges-Köhler-Allee 051 35 Stirling Highway Bismarckallee 7a 79110 Freiburg, Germany Perth, WA 6009, Australia 79098 Freiburg, Germany [email protected] [email protected] [email protected]

Online Analytical Processing (OLAP) GPU Aggregation Algorithm Performance Evaluation

OLAP is a core technology in and Corporate mark consolidate time in time in Performance Management, allowing users to navigate and explore Flags Flags seconds seconds corporate data (usually extracted from a data warehouse) and to roll 10.03 1 0 < 10.03 10.03 up or drill down along different hierarchical levels. Also, updates to 31.24 0 1 < 31.24 12.01 the data must be supported for planning and forecasting. 5.00 0 1 < 5.00 7.90 Due to the highly interactive nature of OLAP analysis, query 12.01 1 1 < 12.01 11.23 reduce 92.80 performance is a key issue. 7.90 1 2 < 7.90 1.85 Result 11.23 1 3 < 11.23 1.11 56.78 0 4 < 56.78 4.92 1.85 1 4 < 1.85 43.75 scan 2.04 0 5 < 2.04 Filtered 1.11 1 5 < 1.11 cells Multidimensional Aggregation 2.44 0 6 < 2.44 25.00 0 6 < 25.00 The conceptual model central to OLAP is the Data Cube, which is a view of the data as cells in a multidimensional table (“cube”). 0.97 0 6 < 0.97 Single compute-intensive Compute-intensive view Aggregation along dimensional hierarchies is a basic building block 36.90 0 6 < 36.90 aggregation (multiple aggregations) involved in most OLAP operations. The main problems to solve in 4.92 1 6 < 4.92 order to compute aggegrates efficiently is the sparsity of data in high- 43.75 1 7 < 43.75 dimensional spaces and the dimensional “explosion”, since the Cells 0 8 Cells number of possible aggregates grows exponentially. Our approach aims at speeding up aggregations by using the massively parallel architecture of GPUs. Current and future work Scalability Optimization of bulk queries Two-stage algorithm Data Storage on GPU The proposed ap- . . . Pre-filtering proach scales well Efficient updates to Usage of different GPU memory types for storage and processing of to multiple GPUs. Spreading updated aggregate values to base facts data cubes: All cube data are Cube 2 Cube N distributed among Cube 1 Allocation of new records in GPU memory all available cards, Dimension compression Computation of advanced business rules ensuring that the Query workload is always Seamless integration of CUDA algorithms into Palo divided evenly be- Dimension Dimension . . . compression compression tween the devices. At the end, the in- Device dividual results are Global Constant aggregated by the Memory Memory main thread on the host. Split into pages About Palo

Processor 1 Processor 2 ... Processor M Jedox AG, headquartered in Freiburg (Germany) with offices in Great Britain and France, is one of the leading suppliers of Open-source Business Intelligence and Corporate Performance

Shared Temporary data for Management solutions in Europe. Jedox’ core product, Palo BI Memory query processing Suite, accommodates the entire range of BI requirements Multiprocessor 1 including planning, reporting and analysis. Multiprocessor 2 . . The multidimensional Palo OLAP Server at the core of the Palo Multiprocessor N BI Suite integrates simply and easily existing MS-Excel solutions and optimizes planning, reporting and analysis.