Evalua on of distributed open source solu ons in CERN database use cases
HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski, L. Canali, E. Grancher
Mo va ons
• propose be er suited solu on for big volumes of read-only data • open new possibili es for data analysis • offload Oracle in data warehouse workloads • our users ask for this!
3 Hadoop evalua on
• One interface (SQL) • Different use cases from CERN • LHC log, Controls data, Atlas jobs manager (PANDA) • Different query engines • Different approaches for storing data
4 Query engines Apache Hive • SQL -> MapReduce jobs Cloudera Impala • Custom SQL planner and executor • Uses Hive metastore Apache Spark SQL • Spark lightweight threading execu on • Can use Hive metastore
5 Data formats Data storage format and compression type make a difference • data volume • IO vs. CPU bound throughput • benefits of columnar storage • par oning
6 Data formats CSV • human readable
7 ACCLOG 8 days - file size comparison 1800 GB
1600 GB
1400 GB 1240 GB 1200 GB
1000 GB
800 GB 649 GB 600 GB
400 GB 238 GB 200 GB 109 GB
0 GB CSV SequenceFile Avro Parquet no compression snappy bzip2 Oracle
8 Data formats CSV • human readable SequenceFile • flat set of binary stored (key, value) pairs
9 ACCLOG 8 days - file size comparison 1800 GB
1600 GB 1545 GB
1400 GB 1240 GB 1200 GB
1000 GB
800 GB 649 GB 600 GB
400 GB 238 GB 265 GB 200 GB 109 GB 117 GB
0 GB CSV SequenceFile Avro Parquet no compression snappy bzip2 Oracle
10 Data formats CSV • human readable SequenceFile • flat set of binary stored (key, value) pairs Apache Avro • binary serializing standard
11 ACCLOG 8 days - file size comparison 1800 GB
1600 GB 1545 GB
1400 GB 1240 GB 1200 GB
1000 GB
800 GB 649 GB 600 GB 542 GB
400 GB 265 GB 238 GB 226 GB 171 GB 200 GB 109 GB 117 GB
0 GB CSV SequenceFile Avro Parquet no compression snappy bzip2 Oracle
12 ACCLOG 8 days - union query execu on me 2000 s 1800 s 1800 s
1600 s
1400 s
1200 s
1000 s
800 s 682 s 572 s 600 s
400 s 216 s 200 s 113 s 118 s
0 s SequenceFile Avro Parquet no compression snappy bzip2
13 Data formats CSV • human readable SequenceFile • flat set of binary stored (key, value) pairs Apache Avro • binary serializing standard Parquet • binary columnar-storage
14 ACCLOG 8 days - file size comparison 1800 GB
1600 GB 1545 GB
1400 GB 1240 GB 1200 GB
1000 GB
800 GB 649 GB 600 GB 542 GB 558 GB
400 GB 265 GB 288 GB 238 GB 226 GB 171 GB 200 GB 109 GB 117 GB
0 GB CSV SequenceFile Avro Parquet no compression snappy bzip2 Oracle
15 ACCLOG 8 days - union query execu on me 2000 s 1800 s 1800 s
1600 s
1400 s
1200 s
1000 s
800 s 682 s 572 s 600 s
400 s 328 s 216 s 200 s 113 s 118 s 117 s
0 s SequenceFile Avro Parquet no compression snappy bzip2
16 Poten al columnar store benefit
Ntuples data exeuc on comparison
500 467.31 450
400
350
300
250
200 Execu on me [s] 150 117 100
50 23.39 0 Parquet Avro Oracle
17 Scalability - Impala
ACCLOG data 300
250
200
150 Measured Ideal
Execu on me [s] 100
50
0 4 8 12 16 Number of nodes
18 Scalability - Hive
ACCLOG data – Hive scalability 600
500
400
300 Measured Ideal
Execu on me [s] 200
100
0 4 8 12 16 Number of nodes
19 Scalability – Spark SQL
ACCLOG data 200 180 160 140
120 100 Measured 80 Ideal 60
Execu on me [s] 40 20 0 4 8 12 16 Number of nodes
20 Summary
• Data format choice is a key • Heterogeneous data in the same place • No ma er what interface -> it scales
21