Classifying, Evaluating and Advancing Big Data Benchmarks

Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften

vorgelegt beim Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität in Frankfurt am Main

von Todor Ivanov aus Stara Zagora

Frankfurt am Main 2019 (D 30) vom Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität als Dissertation angenommen.

Dekan: Prof. Dr. Andreas Bernig

Gutachter: Prof. Dott. -Ing. Roberto V. Zicari Prof. Dr. Carsten Binnig

Datum der Disputation: 23.07.2019 Abstract

The main contribution of the thesis is in helping to understand which software system parameters mostly affect the performance of Big Data Platforms under realistic workloads. In detail, the main research contributions of the thesis are:

1. Definition of the new concept of heterogeneity for Big Data Architectures (Chapter 2);

2. Investigation of the performance of Big Data systems (e.g. Hadoop) in virtual- ized environments (Section 3.1);

3. Investigation of the performance of NoSQL databases versus Hadoop distribu- tions (Section 3.2);

4. Execution and evaluation of the TPCx-HS benchmark (Section 3.3);

5. Evaluation and comparison of Hive and Spark SQL engines using benchmark queries (Section 3.4);

6. Evaluation of the impact of compression techniques on SQL-on-Hadoop engine performance (Section 3.5);

7. Extensions of the standardized Big Data benchmark BigBench (TPCx-BB) (Section 4.1 and 4.3);

8. Definition of a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data architectures (Section 4.5).

The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications. With the explosion of Artificial Intelligence (AI) and new hardware computing power, this is a first step towards a more holistic approach to benchmarking.

iii

Zusammenfassung

Motivation

Im Zeitalter von Big Data, beschrieben oft durch die so genannten 3Vs (Volume, Velocity and Variety) [Lan01; ZE+11], ist es unerlässlich, die richtigen Werkzeuge und bewährte Verfahren bei der Entwicklung von Big Data-Anwendungen einzuset- zen. Traditionell werden Benchmarking-Tools und -Methoden verwendet, um ver- schiedene Technologien sowohl in Bezug auf Leistung als auch auf Funktionalität zu vergleichen [Gra92]. Mit der wachsenden Anzahl von Open-Source- und Enterprise- Tools im Big Data-Ecosystem [Gre18] ist der Bedarf an standardisierten Big Data Benchmarks, die einen detailliert Vergleich zwischen diesen neuen Technologien ermöglichen, stark gestiegen [Che+12]. Gleichzeitig deuten die Fortschritte in der Hardwareentwicklung, wie neue Hardwarebeschleuniger und konfigurierbare Kom- ponenten [Ozd18] (z.B. NVMs (Non-Volatile Memory) [Bou+18], GPUs (Graphics Processing Unit) [Shi+18], FPGAs (Field Programmable Gate Array) [VF18; VPK18], TPUs (Tensor Processing Units) [Jou+17] und mehr), auf eine vollständige Überar- beitung des bestehenden Software-Stacks [KFS18] hin. Solche großen Änderungen in den Backend-Systemen betreffen sowohl die Verarbeitungs- als auch die Speicher- schicht. Künstliche Intelligenz [BAM19], Maschinelles Lernen [PAC18] und Deep Learning [Zha+18] nutzen die Vorteile der neuen Hardwarebeschleuniger [KFS18] und werden mit sehr hohem Tempo entwickelt und in Produktion genommen. Um die Vorteile eines neuen Software-Stacks zu optimieren und zu validieren, sind geeignete und standardisierte Big Data-Benchmarks erforderlich, die Maschinelles Lernen beinhalten.

Das Ziel dieser Arbeit ist es, den Softwareentwicklern und Systemarchitekten zu helfen die effektivste Big Data-Plattform auszuwählen, die den besten Job für die Klasse der gewählten Big Data-Anwendungen leistet. Die wichtigsten Beiträge decken verschiedene relevante Aspekte ab: vom Verständnis der aktuellen Heraus- forderungen in den Big Data Plattformen über die Auswahl des am besten geeigneten Benchmarks für Stresstests der ausgewählten Big Data Technologien, bis hin zur Abstimmung der relevanten Plattformkomponenten zur effizienteren Verarbeitung und Speicherung von Daten.

Vorgehensweise

Die Arbeit verwendet eine neuartige Hybride Benchmark-Methodik (Kapitel 3) beste- hend aus einem Mix aus Best Practices und bestehenden Benchmark-Methoden, die in Kombination mit beliebten standardisierten Benchmarks verwendet werden. Ins-

v besondere ist der Ansatz von den TPC-Benchmark-Methoden inspiriert, versucht aber zeitgleich flexibel und anpassungsfähig an die neuen Arten von Big Data-Benchmarks zu bleiben weil diese in den meisten Fällen keine systematische Methodik zur Durchführung von Experimenten bieten. Abbildung 0.1 veranschaulicht die vier Hauptphasen der Hybriden Benchmark-Methodik. Jede Phase wird im folgenden erläutert:

Fig. 0.1.: Generalisierte Hybride Benchmark-Methodik (Iterativer experimenteller Ansatz), verwendet im Kapitel 3.

• Phase 1 - Platform Setup: Im dieser Phase werden alle Hard- und Soft- warekomponenten installiert und konfiguriert. Dazu gehören die Installation und Konfiguration des Betriebssystems, des Netzwerks, der Programmier- Frameworks, z.B. der Java-Umgebung, und des zu testenden Big Data Systems.

• Phase 2 - Workload Preparation: Der Benchmark, der zum Stresstest der Komponenten des zugrunde liegenden Big Data Systems verwendet wird, wird installiert und konfiguriert. Gleichzeitig werden die zu testenden Plat- tformkomponenten konfiguriert und für das geplante Benchmarking-Szenario vorbereitet. Alle Workload-Parameter werden definiert und die Testdaten wer- den vom Benchmark-Datengenerator erzeugt. Die erzeugten Daten werden zusammen mit den definierten Parametern als Input zur Ausführung der Arbeit- slast in Phase 3 verwendet. Zusätzlich werden Werkzeuge zur Datenerfassung von Metriken und Ressourcenauslastungsstatistiken eingesetzt.

• Phase 3 - Workload-Ausführung: In Phase 3 werden die Benchmark-Exp- erimente durchgeführt. In der Regel wird jedes Experiment drei (oder mehr) Mal wiederholt, um die Repräsentativität der Ergebnisse zu gewährleisten und sicherzustellen, dass es keine Cache-Effekte oder unerwünschte Einflüsse zwischen den aufeinanderfolgenden Testdurchführungen gibt. Die durchschnit- tliche Ausführungszeit zwischen den drei Durchgängen wird zusammen mit der Standardabweichung von den drei Durchgängen als Endwert erfasst. Typischer- weise deutet ein höherer Standardabweichungsprozentsatz auf einen Fehler oder eine Fehlkonfiguration in einer Systemkomponente hin, was zu einer

vi insgesamt instabilen Systemleistung führt. Vor jedem Workload-Experiment müssen die Testdaten in ihren Ausgangszustand zurückgesetzt werden, typis- cherweise durch Löschen der Daten und Erzeugen neuer in Phase 2. In einigen Fällen müssen die Plattform-Caches gelöscht werden, um einen konsistenten Zustand in jedem Experiment zu gewährleisten.

• Phase 4 - Bewertung: In dieser Phase werden die Benchmark-Ergebnisse validiert, um die Korrektheit des Benchmark zu gewährleisten. Anschließend werden die Benchmark-Metriken (typischerweise Ausführungszeit und Durch- satz) und die Ressourcenauslastungsstatistik (CPU, Netzwerk, I/O und Spei- cher) ausgewertet. Grafische Darstellungen (Diagramme) der verschiedenen Metriken werden für die weitere Ergebnisermittlung und -analyse verwendet.

Die vorstehend beschriebene Hybride Benchmark-Methodik kann, wie in Abbildung 0.1 dargestellt, iterativ ausgeführt werden, wobei jede Testausführung mindestens dreimal wiederholt wird (Wechsel zwischen Phase 2 und Phase 4). Neben den obligatorischen drei Testläufen bestehen weitere Gründe, bestimmte Testausführun- gen zu wiederholen. Zum Beispiel durch Variation der Komponentenparameter und Konfigurationen oder durch Bewertung der Plattformleistung unter verschiedenen Datengrößen. Variationen der oben beschriebenen Hybriden Benchmark-Methodik wurden verwendet, um die verschiedenen experimentellen Bewertungen und Vergle- iche im Kapitel 3 durchzuführen.

Wichtigste Forschungsbeiträge und Zusammenfassung der Ergebnisse

Der wichtigste Beitrag der Dissertation ist es, Methoden und Werkzeuge bere- itzustellen, um zu verstehen welche Softwaresysteme und welche Parameter die Leistung von Big Data Plattformen unter bestimmten Workloads am meisten beein- flussen.

Im Detail sind die wichtigsten Forschungsbeiträge der Arbeit wie folgt:

1. Definition des neuen Konzepts der Heterogeneity für Big Data Architek- turen (Kapitel 2) - Dieses Kapitel stellt das Konzept der Heterogeneity in Big Data-Architekturen vor, das als interne Eigenschaft großer Datensysteme ange- sehen werden kann, die sowohl auf vertikale als auch auf generische Workloads abzielen. Es diskutiert, wie dies mit dem bestehenden Hadoop-Ökosystem verknüpft werden kann.

2. Untersuchung der Leistung von Big-Data-Systemen (z.B. Hadoop) in vir- tualisierten Umgebungen (Abschnitt 3.1) - Diese Studie untersucht die Leis- tung typischer Big Data-Anwendungen, die auf einem virtualisierten Hadoop- Cluster mit getrennten Daten- und Berechnungsschichten laufen, im Vergleich zur Standard Hadoop-Cluster-Installation. Die Experimente zeigen, dass ver- schiedene Hadoop-Konfigurationen, die die gleichen virtualisierten Ressourcen verwenden, zu einer unterschiedlichen Leistung führen können. Basierend auf den experimentellen Ergebnissen werden drei wichtige Einflussfaktoren identifiziert.

vii 3. Untersuchung der Performance von NoSQL-Datenbanken im Vergleich zu Hadoop Distributionen (Abschnitt 3.2) - Diese experimentelle Arbeit ver- gleicht Hadoop mit einer repräsentativen NoSQL-Datenbank (Cassandra), die eine ähnliche Speicherschnittstelle wie das Hadoop Distributed File Sys- tem (HDFS) bietet. Beide Speichertechnologien wurden mit der HiBench Benchmark-Suite getestet. Die experimentellen Ergebnisse zeigen, dass Cas- sandra beim Schreiben von Dateien schneller ist als HDFS, während HDFS beim Lesen von Dateien schneller ist als Cassandra.

4. Ausführung und Bewertung des TPCx-HS-Benchmarks (Abschnitt 3.3) - Diese Studie über die Hadoop-Performance verwendet den TPCx-HS-Benchmark mit dem Ziel, die Engpässe des zu testenden Systems aufzuzeigen. Die Ergeb- nisse zeigen, dass der Benchmark schwere Netzwerk- und I/O-intensive Szenar- ien simuliert, die sehr gut konfigurierte Netzwerk- und Clusterumgebungen erfordern.

5. Bewertung und Vergleich von Hive- und Spark-SQL-Engines mittels Bench- mark-Abfragen (Abschnitt 3.4) - In dieser Leistungsstudie werden zwei SQL- on-Hadoop-Engines (Hive auf MapReduce und Spark SQL) mit dem BigBench- Benchmark untersucht, ausgewertet und verglichen. Im Rahmen der Arbeit wurden einige der Hive-Abfragen modifiziert, um sie auf Spark SQL ausführen zu können. Die experimentellen Ergebnisse zeigen, dass Spark SQL schneller ist als Hive, aber immer noch mit instabiler Performance bei vielen Abfragen.

6. Bewertung der Auswirkungen von Kompressionstechniken auf die SQL- on-Hadoop-Engines-Performance (Abschnitt 3.5) - Dieses Experiment unter- sucht den Einfluss populärer Columnar File Formats (Parquet [18j] und ORC [18i]) auf die Performance von SQL-on-Hadoop-Maschinen (Hive und Spark SQL) unter Verwendung des standardisierten BigBench (TPCx-BB) Benchmarks. Die Ergebnisse zeigen, dass ORC im Allgemeinen auf Hive besser performt, während Parquet mit Spark SQL schneller ist. Es zeigt auch, dass die Verwen- dung unterschiedlicher Komprimierungstechniken für beide Dateiformate die Engines-Performance beeinflussen kann.

7. Erweiterungen des standardisierten Big Data Benchmark BigBench (TPCx- BB) (Abschnitt 4.1 und 4.3) - Es wird eine neue Version des standardisierten BigBench Benchmarks definiert, genannt BigBench V2. Ein Hauptbeitrag ist das Entfernen des komplexen schneeflockenartigen Schemas von TPC-DS, welches durch ein einfaches Sternschema ersezt wird. Hierdurch wird ein realistischeres Big Data Modell dargestellt. Darüber hinaus schreibt BigBench V2 eine Late Binding vor, indem die Verarbeitung von Abfragen direkt in Key-Value-Weblogs und nicht wie in BigBench in einer vordefinierten Form erfolgen muss. Es wird ein Proof of Concept vorgestellt, der die Machbarkeit des neuen Benchmarks bestätigt. Eine BigBench V2 Streaming-Komponente, die die Geschwindigkeitseigenschaften in Big Data-Anwendungen prüft, wird entwickelt und mit einer Proof-Konzept-Implementierung im Abschnitt 4.3 vorgestellt.

8. Definition eines neuen Benchmarks, genannt ABench (Big Data Architec- ture Stack Benchmark), der die Heterogeneity von Big Data Architekturen

viii berücksichtigt (Abschnitt 4.5) - Motiviert durch die neuen Features in den aufkommenden Big Data-Technologien und die große Lücke in den beste- henden relevanten Big Data-Benchmarks, stellt dieses Kapitel eine Definition einer neuen Art von Benchmarks vor, welche die Heterogeneity von Big Data- Architekturen berücksichtigt, die mehrere Big Data-Anwendungsfälle, Work- loads und Technologien umfassent. Der neue Benchmark, genannt Big Data Architecture Stack Benchmark oder kurz ABench, ist ein Versuch, eine neue Forschungsrichtung im Benchmarking zu definieren.

Diese Dissertation ist ein Vorstoß, das System-Benchmarking unter Berücksichtigung der neuen Anforderungen der Big Data-Anwendungen neu zu definieren. Mit der mächtige Verwendung der Künstlichen Intelligenz (KI) und der neuen Hardware- Rechenleistung ist dies ein erster Schritt zu einem ganzheitlicheren Ansatz für Benchmarking.

Struktur der Dissertation

Kapitel 2 diskutiert die Herausforderungen von Big Data und neuen Technologien, indem die wichtigsten Big Data-Plattformschichten sowohl vor On-premise als auch in der Cloud untersucht und das neue Konzept der Heterogeneity für Big Data Plattformen eingeführt wird.

Kapitel 3 ist wie folgt aufgebaut:

• Abschnitt 3.1 untersucht die Performance von Big Data Allokationen in vir- tualisierten (z.B. Hadoop) Umgebungen und testet die Management- und Plattformebenen.

• Abschnitt 3.2 untersucht die Performance von NoSQL-Datenbanken im Ver- gleich zu Hadoop Distributionen.

• Abschnitt 3.3 führt aus und bewertet, wie der TPCx-HS Benchmark die Hard- ware-, Management- und Plattformschichten prüft.

• Abschnitt 3.4 vergleicht und bewertet Hive- und Spark-SQL-Engines anhand von Benchmark-Abfragen.

• Abschnitt 3.5 untersucht und bewertet den Einfluss von Kompressionstechniken auf die SQL-on-Hadoop-Engines-Performance.

Abbildung 0.2 veranschaulicht die enge Integration zwischen der hybriden Benchmark- Methodik und den verschiedenen Big Data Architecture-Schichten (blau), die auf die im Kapitel 2 definierten Heterogeneity levels abgebildet sind.

Kapitel 4 gliedert sich wie folgt:

• Abschnitt 4.1 erweitert den standardisierten Big Data Benchmark BigBench (TPCx-BB) in BigBench V2.

ix Fig. 0.2.: Die Hybrid Benchmark Methodik integriert mit den verschiedenen Big Data Architecture Layern, die auf die Heterogeneity levels abgebildet sind.

• Abschnitt 4.2 wertet die Nutzung von Hive und Spark SQL mit BigBench V2 aus.

• Abschnitt 4.3 erweitert BigBench V2 um eine Streaming-Komponente, die die Geschwindigkeitseigenschaften (Velocity) in Big Data-Anwendungen betont.

• Abschnitt 4.4 evaluiert das Spark Structured Streaming mit Hilfe der Streaming- Erweiterung des BigBench V2.

• Abschnitt 4.5 definiert einen neuen Benchmark, genannt ABench (Big Data Ar- chitecture Stack Benchmark), der die Heterogeneity von Big Data Architekturen berücksichtigt.

Abbildung 0.3 veranschaulicht die Beziehungen zwischen den Kapiteln 3, 4 und 5 der Arbeit.

Insbesondere die Erfahrung mit der standardisierten BigBench (TPCx-BB) zur Eva- luierung einer SQL-on-Hadoop-Engine inspirierte die Entwicklung und Implemen- tierung einer neuen Version namens BigBench V2 (Abschnitte 4.1 und 4.2).

Beim Experimentieren mit BigBench V2 und der Erforschung des Konzepts der Heterogeneity in Big Data-Plattformen wurde deutlich, dass es einen Bedarf an einem umfassenderen End-to-End-Benchmark für Big Data gibt. Dieser sollte zusetlich leicht erweiterbar, um neue Arten von Big Data-Workloads wie Streaming-Abfragen und komplexe maschinelle Lernpipelines abzudecken. Dies führte zur Entwicklung eines neuen Benchmarks, genannt Big Data Architecture Stack Benchmark oder kurz ABench (Abschnitt 4.5).

ABench ist das zentrale Konzept des Kapitels 4 (in Abbildung 0.3 grün markiert) und baut auf der bestehenden BigBench V2 auf, die wir auch um eine Streaming-

x Fig. 0.3.: ABench Roadmap

Komponente erweitert haben (siehe Abschnitt 4.3). Derzeit sind zwei weitere Erweiterungen von ABench in Entwicklung, die im Kapitel 5 vorgestellt werden (in Abbildung 0.3 gelb markiert). Die erste erweitert die Workloads des Maschinellen Lernens um einen weitere Workload (Abschnitt 5.1.2), während die zweite sich auf den Aufbau einer neuen flexiblen Plattforminfrastruktur (Abschnitt 5.1.1) konzentri- ert, um sowohl Cloud- als auch On-Premise-Benchmark-Implementierungen nahtlos zu ermöglichen.

Am Ende schließt Kapitel 5 die Arbeit ab, indem sie die Beiträge der Dissertation umreißt, den aktuellen Status der laufenden Benchmarkprojekte und die zukünftige Arbeit präsentiert.

Die sieben Anhänge der Arbeit beinhalten eine neuartige Klassifizierung bestehen- der Big Data-Benchmarks sowie ergänzende Materialien wie Diagramme, Tabellen und Abfragecode, die in den verschiedenen experimentellen Studien verwendet werden.

xi

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Approach and methodologies used in this research ...... 2 1.2.1 Background ...... 2 1.2.2 Methodology ...... 4 1.3 Main research contributions and summary of results ...... 6 1.4 Dissertation Structure ...... 8 1.5 Publications and Contributions ...... 10 1.5.1 List of Publications ...... 10 1.5.2 List of Project Deliverables ...... 12 1.5.3 List of Co-supervised Master Theses ...... 12 1.5.4 List of Co-supervised Bahelor Theses ...... 13

2 The Heterogeneity Paradigm in Big Data Architectures 15 2.1 Introduction ...... 15 2.2 Background ...... 17 2.2.1 Big Data characteristics and Cost Factor ...... 17 2.2.2 Cloud Computing and Big Data ...... 19 2.3 Challenges in Big Data Architectures ...... 21 2.3.1 Heterogeneous Systems ...... 22 2.3.2 Emerging Big Data Systems ...... 23 2.4 Heterogeneity in Big Data Systems ...... 25 2.4.1 Hardware Level ...... 27 2.4.2 Management Level ...... 28 2.4.3 Platform Level ...... 31 2.4.4 Application Level ...... 35 2.5 Summary and Future Research Directions ...... 39

3 Evaluation of Big Data Platforms and Benchmarks 41 3.1 Performance Evaluation of Virtualized Hadoop Clusters ...... 42 3.1.1 Introduction ...... 42 3.1.2 Background ...... 43 3.1.3 Experimental Environment ...... 45 3.1.4 Benchmarking Methodology ...... 47 3.1.5 Experimental Results ...... 49 3.1.6 Lessons Learned ...... 63 3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 65 3.2.1 Introduction ...... 65 3.2.2 Background ...... 66 3.2.3 Related Work ...... 67 3.2.4 Setup and Configuration ...... 68

xiii 3.2.5 Experimental Results ...... 70 3.2.6 Lessons Learned ...... 79 3.3 Evaluating Hadoop Clusters with TPCx-HS ...... 81 3.3.1 Introduction ...... 81 3.3.2 Background ...... 82 3.3.3 Experimental Setup ...... 83 3.3.4 Benchmarking Methodology ...... 86 3.3.5 Experimental Results ...... 89 3.3.6 Resource Utilization ...... 93 3.3.7 Lessons Learned ...... 103 3.4 Performance Evaluation of Spark SQL using BigBench ...... 104 3.4.1 Introduction ...... 104 3.4.2 Towards BigBench on Spark ...... 106 3.4.3 Issues and Improvements ...... 107 3.4.4 Performance Evaluation ...... 107 3.4.5 Query Resource Utilization ...... 113 3.4.6 Lessons Learned and Future Work ...... 118 3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance ...... 120 3.5.1 Introduction ...... 120 3.5.2 Background and Related Work ...... 123 3.5.3 Experimental Setup ...... 129 3.5.4 Hive Results ...... 135 3.5.5 Spark SQL Results ...... 141 3.5.6 In-Depth Query Analysis ...... 147 3.5.7 Summary and Lessons Learned ...... 156 3.6 Conclusions ...... 158

4 ABench: Big Data Architecture Stack Benchmark 161 4.1 BigBench V2: The New and Improved BigBench ...... 163 4.1.1 Introduction ...... 163 4.1.2 Related Work ...... 165 4.1.3 Data Model ...... 166 4.1.4 Data Generation ...... 169 4.1.5 Workload ...... 171 4.1.6 Proof of Concept ...... 173 4.1.7 Conclusions & Future Work ...... 180 4.2 Evaluating Hive and Spark SQL with BigBench V2 ...... 181 4.2.1 Motivation ...... 181 4.2.2 Experimental Setup ...... 182 4.2.3 Hive Data Size Scaling ...... 184 4.2.4 Spark SQL Data Size Scaling ...... 185 4.2.5 Conclusion ...... 188 4.3 Adding Velocity to BigBench ...... 192 4.3.1 Introduction ...... 192 4.3.2 Related Work ...... 193 4.3.3 BigBench Streaming Extension ...... 194 4.3.4 Proof of Concept ...... 199 4.3.5 Conclusions and Future Work ...... 202 4.4 Exploratory Analysis of Spark Structured Streaming ...... 204

xiv 4.4.1 Introduction ...... 204 4.4.2 Spark Structured Streaming ...... 205 4.4.3 Related Work ...... 208 4.4.4 Structured Streaming Micro-benchmark ...... 208 4.4.5 Exploratory Analysis ...... 212 4.4.6 Lessons Learned ...... 214 4.5 ABench: Big Data Architecture Stack Benchmark ...... 215 4.5.1 Motivation ...... 215 4.5.2 Benchmark Overview ...... 216 4.5.3 Architecture Benchmark Framework ...... 219 4.5.4 Benchmark Use Cases ...... 221 4.5.5 Conclusions ...... 222

5 Conclusions 223 5.1 On-going Research Work ...... 223 5.1.1 ABench: Flexible Platform Infrastructure Implementation . . 223 5.1.2 ABench: Enriching the Machine Learning Workloads . . . . . 225 5.1.3 DataBench Project ...... 227 5.2 Conclusions ...... 229

Bibliography 231

Appendices 265

Appendix A Classification of Big Data Benchmarks 267 A.1 Benchmark Organizations ...... 268 A.2 Micro-Benchmarks ...... 270 A.3 Big Data and SQL-on-Hadoop Benchmarks ...... 278 A.4 Streaming Benchmarks ...... 285 A.5 Machine and Deep Learning Benchmarks ...... 291 A.6 Graph Benchmarks ...... 297 A.7 Emerging Benchmarks ...... 298 A.8 Benchmark Platforms ...... 299

Appendix B Cluster Hardware 301

Appendix C Evaluating Hadoop Clusters with TPCx-HS 305

Appendix D BigBench V2 - New Queries Implementation 307 D.1 BigBench Q05 ...... 307 D.2 BigBench Q06 ...... 307 D.3 BigBench Q07 ...... 309 D.4 BigBench Q09 ...... 310 D.5 BigBench Q13 ...... 310 D.6 BigBench Q14 ...... 311 D.7 BigBench Q16 ...... 312 D.8 BigBench Q17 ...... 313 D.9 BigBench Q19 ...... 313 D.10 BigBench Q20 ...... 313 D.11 BigBench Q21 ...... 314 D.12 BigBench Q22 ...... 315

xv D.13 BigBench Q23 ...... 316

Appendix E Evaluating Hive and Spark SQL with BigBench V2 317

Appendix F Performance Evaluation of Spark SQL using BigBench 321

Appendix G The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 327 G.1 BigBench Q08 (MapReduce/Python) ...... 327 G.2 BigBench Q10 (HiveQL/OpenNLP) ...... 328 G.3 BigBench Q12 (Pure HiveQL) ...... 329 G.4 BigBench Q25 (HiveQL/SparkMLlib) ...... 330

xvi Introduction 1

1.1 Motivation

In the age of Big Data, often characterized by the so called 3Vs (Volume, Velocity and Variety) [Lan01; ZE+11], it is essential to use the right tools and best practices when implementing Big Data applications. Traditionally, benchmarking tools and methodologies have been used to compare different technologies both in terms of performance and functionality [Gra92]. With the growing number of open source and enterprise tools in the Big Data Ecosystem [Gre18], the need of standardized Big Data Benchmarks that provide accurate comparison between these new technologies has become very important [Che+12]. At the same time the advances in hardware development, such as new hardware accelerators and configurable components [Ozd18] e.g. NVMs (Non-Volatile Memory) [Bou+18], GPUs (Graphics Processing Unit) [Shi+18], FPGAs (Field Programmable Gate Array) [VF18; VPK18], TPUs (Tensor Processing Units) [Jou+17] and more suggest a complete rewriting of the existing software stack [KFS18]. Such major changes in the backend systems impact both the processing and storage layers. Artificial Intelligence [BAM19], Machine Learning [PAC18] and Deep Learning [Zha+18] are taking advantage of the new hardware accelerators [KFS18] and are developed and put into production every single day at a very high pace. In order to optimize and validate the benefits of the new software stack, suitable and standardized Big Data benchmarks that include Machine Learning are necessary.

The goal of this thesis is to help the software developers and system architects to choose the most effective Big Data platform that does the best job for the class of Big Data applications chosen. The main thesis contributions cover various relevant aspects from understanding the current challenges in the Big Data platforms, choos- ing the most appropriate benchmark to stress test the selected Big Data technologies, and tuning the relevant platform components to process and store more efficiently data.

1 1.2 Approach and methodologies used in this research

1.2.1 Background

The main focus of the thesis is benchmarking, which is a common term in both tech- nical and business fields [Cam95; AP95; Gra92]. In particular, the work deals with Big Data benchmarks primarily inspired by previous related work in standardized database benchmarks. The meaning of the word benchmark is defined in [AP95] as:

" A predefined position, used as a reference point for taking measures against."

This is very general definition valid for both business and technical benchmarks. However, there is no formal definition of Big Data or Analytics Benchmarks, which are the main topic of this dissertation.

Technical benchmarking can be seen as a process of applying transparent and common methodologies to compare systems or software technologies. Jim Gray back in 1992 [Gra92] described benchmarking as follows:

"This quantitative comparison starts with the definition of a benchmark or workload. The benchmark is run on several different systems, and the performance and price of each system is measured and recorded. Performance is typically a throughput metric (work/second) and price is typically a five-year cost-of-ownership metric. Together, they give a price/performance ratio."

In short, we define that a software benchmark is a program used for comparison of software products/tools executing on a pre-configured hardware environment.

Why and when do you need a benchmark? Typically, technical benchmarks are used to compare hardware platforms, software technologies or the combination of both. Jim Gray [Gra92] identifies four comparison categories in which benchmarks can be applied:

• To compare different software and hardware systems: The goal is to use metric reported by the benchmark as a comparable unit for evaluating the performance of different data technologies on different hardware running the same application. This case represents classical competitive situation between hardware vendors.

• To compare different software on one machine: The goal is to use the benchmark to evaluate the performance of two different software products running on the same hardware environment.This case represents classical competitive situation between software vendors.

• To compare different machines in a comparable family: The objective is to compare similar hardware environments by running the same software product

2 Chapter 1 Introduction and application benchmark on each of them. This case represents a comparison of different generations of vendor hardware or for a case comparing of different hardware vendors.

• To compare different releases of a product on one machine: The objective is to compare different releases of a software product by running benchmark experiments on the same hardware. Ideally the new releases should perform faster (based on the benchmark metric) than its predecessors. This can be also seen as performance regression tests that can assure the new release support all previous system features.

Motivated by the different comparison categories, there are different types of benchmarks: micro-benchmarks, application-level benchmarks and benchmark suites.

Micro-benchmarks are either a program or routine to measure and test the per- formance of a single component or task [Pog18]. They are used to evaluate either individual system components or specific system behaviors (or functions of codes) [HJZ18]. Micro-benchmarks report simple and well-defined quantities such as elapsed time, rate of operations, bandwidth, or latency [Pog18]. Typically, they are developed for a specific technology, which reduces their complexity and development overhead. Popular micro-benchmark examples also part of the Hadoop binaries are WordCount, TestDFSIO, Pi, K-means, HiveBench [And11] and many others.

Application-level benchmarks also known as End-to-end benchmarks are de- signed to evaluate the entire system using typical application scenarios, each sce- nario corresponds to a collection of related workloads [HJZ18]. Typically, this type of benchmarks are more complex and are implemented using multiple technolo- gies, which makes them significantly harder to develop. For example application- level Big Data benchmarks are the one standardized by the Transaction Process- ing Performance Council (TPC) [TPC18a] such as TPC-H [18ah], TPC-DS [18af], BigBench(TPCx-BB) [TPC17] and many others.

Benchmark suites are combinations of different micro and/or end-to-end (application- level) benchmarks and these suites aim to provide comprehensive benchmarking solutions [HJZ18]. Examples for Big Data benchmark suites are HiBench [Int15], SparkBench [Min15], CloudSuite [Fer+12b], BigDataBench [ICT15], PUMA [Uni12] and many others.

Another important distinction between benchmarks is if they are standardized by an official organization (like SPEC [SPE18] or TPC [TPC18a]) or not standardized (typically developed by a vendor or research organization).

Jim Gray [Gra92] defined four important criteria that domain-specific benchmarks must meet:

• Relevant: It must measure the peak performance and price/performance of systems when performing typical operations within that problem domain.

1.2 Approach and methodologies used in this research 3 • Portable: It should be easy to implement the benchmark on many different systems and architectures.

• Scalable: The benchmark should apply to small and large computer systems. It should be possible to scale the benchmark up to larger systems, and to parallel computer systems as computer performance and architecture evolve.

• The benchmark must be understandable/interpretable, otherwise it will lack credibility.

Similarly, Karl Huppler [Hup09] outlines five key characteristics that all "good" benchmarks should have:

• Relevant - A reader of the result believes the benchmark reflects something important.

• Repeatable - There is confidence that the benchmark can be run a second time with the same result.

• Fair - All systems and/or software being compared can participate equally.

• Verifiable - There is confidence that the documented result is real.

• Economical - The test sponsors can afford to run the benchmark.

1.2.2 Methodology

The thesis utilizes a novel hybrid benchmark methodology (Chapter 3) consisting of a mix of best practices and existing benchmark methodologies used in combination with popular standardized benchmarks. In particular the approach is inspired by the TPC benchmark methodologies, but tries to remain flexible and adaptive to the new types of Big Data benchmarks, which in most cases do not provide a systematic methodology for performing of experiments. Figure 1.1 illustrates the main four phases of the hybrid benchmark methodology. Each phase is explained in the text below:

• Phase 1 - Platform Setup: This is the initial phase where all hardware and software components are installed and configured. These include the instal- lation and configuration of the Operating System, Network, Programming Frameworks, e.g. Java Environment, and the Big Data System under test.

• Phase 2 - Workload Preparation: The benchmark that will be used to stress test the components of the underlying Big Data System is installed and config- ured. At the same time, the platform components under test are configured and prepared for the planned benchmarking scenario. All workload parameters are defined and the test data is generated by the benchmark data generator. The generated data together with the defined parameters are then used as input to execute the workload in Phase 3. Additionally, tools for data collection of metrics and resource utilization statistics are applied.

4 Chapter 1 Introduction Fig. 1.1.: Generalized Hybrid Benchmark Methodology (Iterative Experimental Approach) used in Chapter 3

• Phase 3 - Workload Execution: In Phase 3 the benchmark experiments are executed. Typically each experiment is repeated 3 or more times to ensure the representativeness of the results and to make sure that there is no cache effects or undesired influences between the consecutive test executions. The average execution time between the 3 runs is taken as a final value together with the standard deviation from the 3 runs. Typically, higher standard deviation percentage suggests failure or miss-configuration in any system component that leads to overall unstable system performance. Before each workload ex- periment, the test data needs to be reset to its initial state, typically by deleting the data and generating new one in Phase 2. In some cases the platform caches need to be cleared to assure consistent state in each experiment.

• Phase 4 - Evaluation: In this phase, the benchmark results are validated to guarantee the benchmark correctness. Then the benchmark metrics (typically execution time and throughput) and the resource utilization statistics (CPU, Network, I/O and Memory) are evaluated. Graphical representations (charts and diagrams) of the various metrics are used for further results investigation and analysis.

The hybrid benchmark methodology described above can be executed in an iterative manner as depicted on Figure 1.1, where each test execution is repeated 3 times (switching between Phase 2 to Phase 4). Apart from the mandatory 3 test runs, there are various other reasons to repeat certain test executions. For example by varying the component parameters and configurations or evaluating the platform performance under different data sizes. Variations of the hybrid benchmark method- ology, described above, were used to perform the different experimental evaluations and comparisons in Chapter 3.

1.2 Approach and methodologies used in this research 5 1.3 Main research contributions and summary of results

The main contribution of the thesis is in helping to understand which software systems and which parameters mostly affect the performance of Big Data Platforms under specific workloads.

In detail, the main research contributions of the thesis are as follows:

1. Definition of the new concept of heterogeneity for Big Data Architectures (Chapter 2) - This chapter introduces the concept of heterogeneity in Big Data architectures, which can be seen as internal property of big data systems targeted to both vertical and generic workloads and discusses how this can be linked with the existing Hadoop ecosystem.

2. Investigation of the performance of Big Data systems (e.g. Hadoop) in virtualized environments (Section 3.1) - This study investigates the perfor- mance of typical Big Data applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop cluster installation. The experiments showed that different Hadoop configurations, utilizing the same virtualized resources can result in different performance, based on which are identified three important factors.

3. Investigation of the performance of NoSQL databases versus Hadoop dis- tributions (Section 3.2) - This experimental work compares Hadoop with a representative NoSQL storage engine (Cassandra) that offers similar storage interface as the Hadoop Distributed File System (HDFS). Both storage tech- nologies were tested using the HiBench benchmark suite. The experimental results showed that Cassandra is faster than HDFS in writing files, whereas HDFS is faster than Cassandra in reading files.

4. Execution and evaluation of the TPCx-HS benchmark (Section 3.3) - This study, on the Hadoop performance, utilizes the TPCx-HS benchmark with the goal to determine what are the system under test bottlenecks. The results show that the benchmark simulates heavy network and I/O-intensive scenarios that require very well configured network and cluster environments.

5. Evaluation and comparison of Hive and Spark SQL engines using bench- mark queries (Section 3.4) - In this performance study two SQL-on-Hadoop engines (Hive on MapReduce and Spark SQL) using the BigBench benchmark are evaluated and compared. As part of the research, a number of the Hive queries were modified to be able to run on Spark SQL. The experimental results showed that Spark SQL is faster than Hive, but still with unstable performance on many queries.

6. Evaluation of the impact of compression techniques on SQL-on-Hadoop engine performance (Section 3.5) - This experiment investigates the influence of popular Columnar File Formats (Parquet [18j] and ORC [18i]) on SQL-on-

6 Chapter 1 Introduction Hadoop engines (Hive and Spark SQL) performance using the standardized BigBench (TPCx-BB) benchmark. The results show that ORC generally per- forms better on Hive, whereas Parquet is faster with Spark SQL. It also shows that using different compression techniques on both file formats can influence the engine performance.

7. Extensions of the standardized Big Data benchmark BigBench (TPCx-BB) (Section 4.1 and 4.3) - We defined a new version of the standardized BigBench benchmark, called BigBench V2. One main contribution was removing the complex snowflake-like schema of TPC-DS and replacing it with a simple star schema representing a real life Big Data model. Additionally, BigBench V2 mandates late binding by requiring query processing to be done directly on key-value web-logs rather than a pre-parsed form of it, like in BigBench. A proof of concept implementation is presented that validates the feasibility of the new benchmark. A BigBench V2 streaming component, stressing the velocity characteristics in Big Data applications, is developed and demonstrated with a proof concept implementation in section 4.3.

8. Definition of a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data architectures (Section 4.5) - Motivated by the new features in the emerging Big Data technologies and the big gap in existing relevant Big Data benchmarks, this chapter presents a definition of a new type of benchmark that takes into account the heterogeneity of Big Data architectures incorporating multiple Big Data use cases, workloads and technologies. The new benchmark, called Big Data Architecture Stack Benchmark or short ABench, is an attempt to define a new research direction in benchmarking.

The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications. With the explosion of Artificial Intelligence (AI) and new hardware computing power, this is a first step towards a more holistic approach to benchmarking.

1.3 Main research contributions and summary of results 7 1.4 Dissertation Structure

The remainder of the dissertation is organized in four chapters and seven appendices as illustrated in Figure 1.2. The related work and background context positioning the different research areas are included in the corresponding chapters. In addition to this, the chapters focusing on related work are marked with H whereas the one with research contributions are marked with %.

Fig. 1.2.: Thesis Structure (Chapters)

Chapter 2 discusses the Big Data challenges and emerging technologies by inves- tigating the main Big Data platform layers both on-premise and in the cloud, and introducing the new concept of heterogeneity for Big Data Platforms. H%

Chapter 3 covers experiments using different Big Data benchmarks to evaluate various virtualized and "bare-metal" Hadoop clusters configurations, compares the Big Data capabilities of NoSQL storage engine (Cassandra [Cas19]) with the Hadoop Distributed File System (HDFS), compares SQL-on-Hadoop engines (Hive [Hiva] and Spark SQL [18ac]) and Columnar File Formats (Parquet [18j] and ORC [18i]). %

Chapter 4 presents the definition and development of BigBench V2, the streaming extension for BigBench V2, and evaluates the Spark Structured Streaming. Last but not least, the chapter presents a Big Data Architecture Stack Benchmark, called ABench and a vision on its design. %

Finally, Chapter 5 concludes the thesis by outlining the thesis contributions, present- ing the current status of the on-going benchmark projects and the future work.

8 Chapter 1 Introduction The seven Appendices at the end include a novel classification of existing Big Data benchmarks, and supplementary materials such as charts, tables and query code used in the different experimental studies.

1.4 Dissertation Structure 9 1.5 Publications and Contributions

1.5.1 List of Publications

• Todor Ivanov and Matteo Pergolesi, The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance: A Study on ORC and Parquet, (Submitted to Journal Concurrency and Computation: Practice and Experience, Wiley on 26.09.2018).

• Todor Ivanov, Patrick Bedué, Ahmad Ghazal, Roberto V. Zicari, Adding Veloc- ity to BigBench, In: Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018. 2018, 6:1–6:6 (cit. on p. 147).

• Todor Ivanov and Roberto V. Zicari, Analytics Benchmarks, In: Encyclopedia of Big Data Technologies. Ed. by Sherif Sakr and Albert Zomaya. Cham: Springer International Publishing, 2018, pp. 1–10 (cit. on p. 19).

• Todor Ivanov and Jason Taafe, Exploratory Analysis of Spark Structured Streaming, in Proceedings of the 4th International Workshop on Performance Analysis of Big data Systems (PABS 2018), April 9th, Berlin, Germany, 2018.

• Todor Ivanov and Rekha Singhal, ABench: Big Data Architecture Stack Benchmark, in Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE 2018), April 9-13, Berlin, Germany, 2018.

• Luca Felicetti, Mauro Femminella, Todor Ivanov, Pietro Lio, Gianluca Reali, A big-data layered architecture for analyzing molecular communications sys- tems in blood vessels, in Proceedings of the 4th ACM International Conference on Nanoscale Computing and Communication, NANOCOM 2017, Washington, DC, USA, September 27-29, 2017.

• Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, Roberto Zicari, BigBench V2: The New and Improved BigBench, in Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE 2017), April 19-22, 2017, San Diego, California, USA.

• Roberto V. Zicari, Marten Rosselli, Todor Ivanov, Nikolaos Korfiatis, Karsten Tolle, Raik Niemann, Christoph Reichenbach, Setting up a Big Data Project: Challenges, Opportunities, Technologies and Optimization, in Big Data Op- timization: Recent Developments and Challenges, (pp. 17-47), Springer Book, 2016.

• Todor Ivanov, Sead Izberovic, Nikolaos Korfiatis, The Heterogeneity Paradigm in Big Data Architectures, in Managing and Processing Big Data in Cloud Com- puting, (pp. 218-245), IGI Global Handbook of Research, 2016 and in Artificial Intelligence: Concepts, Methodologies, Tools, and Applications, (pp. 485-511), IGI Global, 2017.

10 Chapter 1 Introduction • Todor Ivanov, Tilmann Rabl, Meikel Poess, Anna Queralt, John Poelman, Nicolas Poggi, Jeffrey Buell, Big Data Benchmark Compendium, in Proceed- ings of the 7th TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2015), August 31, 2015, Kohala Coast, Hawai.

• Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle, Roberto V. Zicari, Performance Evaluation of Enterprise Big Data Platforms with HiBench, in Proceedings of the 9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE 2015), August 20-22, 2015, Helsinki, Finland.

• Raik Niemann and Todor Ivanov, Evaluating the Energy Efficiency of Data Management Systems, in Proceedings of the 4th IEEE/ACM International Workshop on Green and Sustainable Software (GREENS 2015), Florence, Italy, May 18, 2015.

• Todor Ivanov and Max-Georg Beer, Evaluating Hive and Spark SQL with Big- Bench, Frankfurt Big Data Lab, Technical Report No.2015-2, arXiv:1512.08417.

• Todor Ivanov and Max-Georg Beer, Performance Evaluation of Spark SQL using BigBench, in Proceedings of the 6th Workshop on Big Data Benchmark- ing (6th WBDB), June 16-17, 2015, Toronto, Canada.

• Todor Ivanov and Sead Izberovic, Evaluating Hadoop Clusters with TPCx- HS, Frankfurt Big Data Lab, Technical Report No.2015-1, arXiv:1509.03486.

• Raik Niemann and Todor Ivanov, Modelling the Performance, Energy Con- sumption and Efficiency of Data Management Systems, in Proceedings of the Workshop Big Data, Smart Data and Semantic Technologies (BDSDST 2015), September 29, INFORMATIK 2015, Cottbus, Germany.

• Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle, Roberto V. Zicari, Benchmarking DataStax Enterprise/Cassandra with Hi- Bench, Frankfurt Big Data Lab, Technical Report No. 2014-2, arXiv:1411.4044.

• Todor Ivanov, Roberto V. Zicari, Sead Izberovic, Karsten Tolle, Performance Evaluation of Virtualized Hadoop Clusters, Frankfurt Big Data Lab, Technical Report No.2014-1, arXiv:1411.3811.

• Todor Ivanov, Roberto Zicari, Alejandro Buchmann, Benchmarking Virtu- alized Hadoop Clusters, in Proceedings of the 5th Workshop on Big Data Benchmarking (WBDB 2014), August 2014, Potsdam, Germany.

• Todor Ivanov, Nikolaos Korfiatis, Roberto V. Zicari, On the inequality of the 3V’s of Big Data Architectural Paradigms: A case for heterogeneity, Frankfurt Big Data Lab, Working Paper, November 2013, arXiv:1311.0805.

1.5 Publications and Contributions 11 1.5.2 List of Project Deliverables

EU Project "Leveraging Big Data to Manage Transport Operations" (LeMO), H2020, Project ID: 770038

• Deliverable 1.3 - Big Data Methodologies, Tools and Infrastructures (2018) - Kim Hee, Todor Ivanov, Roberto V. Zicari, Rut Waldenfels, Hevin Özmen, Naveed Mushtaq, Minsung Hong, Tharsis Teoh, Rajendra Akerkar

EU Project "Evidence Based Big Data Benchmarking to Improve Business Per- formance" (DataBench), H2020, Project ID: 780966

• Deliverable 3.1 - DataBench Architecture (2018) - Tomás Pariente, Iván Martínez, Ricardo Ruiz, Todor Ivanov, Arne Berre, Chiara Francalanci

• Deliverable 1.1 - Industry Requirements with benchmark metrics and KPIs (2018) - Barbara Pernici, Chiara Francalanci, Angela Geronazzo, Lucia Polidori, Gabriella Cattaneo, Helena Schwenk, Marko Grobelnik, Tomás Pariente, Iván Martínez, Todor Ivanov, Arne Berre

1.5.3 List of Co-supervised Master Theses

• Enriching the Machine Learning Workloads of BigBench - Matthias Polag, 2018

• Entwicklung eines parametrisierten Datengenerators als Erweiterung des Yahoo Streaming Benchmarks zur Analyse einer Streaming Data Pipeline - Maximil- ian Kaprolat, 2018

• Evaluation of and AsterixDB on BigFun Benchmark - Svitlana Rybalko, 2018

• Extending BigBench using Structured Streaming in - Jason Taaffe, 2017

• ABENCH: benchmarking clustering algorithms using a data pipeline with Apache Spark - Ludovica Sacco, 2017

• Operational Decision Management in an Industrie 4.0 Context - Nikolay Mi- haylov, 2017

• Analytical Decision Management in an Industrie 4.0 Context - Vladimir Yankov, 2017

• Evaluation of TPC-H on Spark & SparkSQL in ALOJA - Raphael Radowitz, 2017

• Integration und Evaluation von Apache-Drill in Aloja - Sebastian Hamann, 2017

12 Chapter 1 Introduction • Implementing real-time stream processing in the BigBench Benchmark - Patrick Bedué, 2017

• SPARQL-Benchmarks automatisiert im Big Data Umfeld ausführen - Max Hof- mann und Timo Eichhorn, 2016

• Evaluation of BigBench on Apache Spark Compared to MapReduce - Max- Georg Beer, August 2015

• Vergleich zwischen Datastax Enterprise und Hadoop: Umsetzungsanalyse und Benchmarking - Pavel Safre, February 2015

1.5.4 List of Co-supervised Bahelor Theses

• Implementierung von Streamingdaten in der Musikindustrie mithilfe der Inter- Systems IRIS Data Platform – Hoang Duc Anh Tran, 2018

• Implementierung von Finanz-Datenströmen mithilfe von InterSystems IRIS Data Platform – Galena Teneva, 2018

• Implementierung von Fitness Tracker Daten in InterSystems IRIS Data Platform – Tuan Minh Do, 2018

• Implementing NBA Player Movement Data using InterSystems IRIS Data Plat- form – Jasenko Donlagic, 2018

• Evaluation der Benchmarking Plattform Hobbit - Hamid Jalali, 2018

• Interactive Big Data Benchmarking Portal - Mile Kovac, 2017

• Performanzanalyse von MongoDB Storage Engines durchgeführt mit LinkBench - Valeri Penchev, 2017

• Implementierung und Performanzanalyse von TPC-B auf MongoDB - Fadi Kreem, 2017

• Stream Processing im Big Data Umfeld: Fehlertoleranz und Datenintegrität - Michael Czaja, 2016

• Migration von MapReduce Funktionen zu Spark in BigBench Benchmark - Nicklas Velte, November 2015

• Analyse der Performance verschiedener Dateiformate mit Hilfe des BigBench Benchmarks - Thomas Stokowy, December 2015

• A Conceptual Hadoop Cluster Architecture using Docker - David Komijat, October 2015

1.5 Publications and Contributions 13

The Heterogeneity Paradigm in 2 Big Data Architectures

This chapter introduces the new concept of heterogeneity as a perspective in the architecture of big data systems targeted to both vertical and generic workloads and discusses how this can be linked with the existing Hadoop ecosystem. The case of the cost factor of a big data solution and its characteristics can influence its architectural patterns and capabilities and as such an extended model based on the 3V paradigm is introduced (Extended 3V). This is examined on a hierarchical set of four layers (Hardware, Management, Platform and Application). A list of components is provided on each layer as well as a classification of their role in a big data solution. The chapter is based on the following publications:

• Todor Ivanov, Sead Izberovic, Nikolaos Korfiatis, The Heterogeneity Paradigm in Big Data Architectures, in Managing and Processing Big Data in Cloud Com- puting, (pp. 218-245), IGI Global Handbook of Research, 2016 and in Artificial Intelligence: Concepts, Methodologies, Tools, and Applications, (pp. 485-511), IGI Global, 2017, [IIK16].

• Todor Ivanov, Nikolaos Korfiatis, Roberto V. Zicari, On the inequality of the 3V’s of Big Data Architectural Paradigms: A case for heterogeneity, Frankfurt Big Data Lab, Working Paper, November 2013, arXiv:1311.0805, [IKZ13].

Keywords: 3Vs, Heterogeneity, Big data platforms, Big data systems architecture

2.1 Introduction

Undoubtedly the exponential growth of data and its use in supporting business decisions has challenged the processing and storage capabilities of modern informa- tion systems especially in the past decade. The ability to handle and manage large volumes of data has gradually turned to a strategic one [Chi+13]. Meanwhile, the term “Big Data” [Die12] is rapidly transformed into the new hype, following a path similar to Cloud Computing [Arm+10]. A general challenge for both researchers and practitioners on answering this issue and meet tight requirements (e.g. time to process), is what kind of design improvements need to be applied and how can the data system in use “scale”. This requirement for system scalability is applied both in terms of parallel as well as distributed data processing with major architectural changes and use of new software technologies like Hadoop [18d] being the current trend.

15 On the other hand, theoretical definitions of what “Big Data” is and how it can be utilized by organizations and enterprises has been a subject of debates [Jac09]. On that aspect, the 3V framework has gained considerable attention since it was introduced by Laney [Lan01]. In that representation “Big Data” can be defined by three distinctive characteristics, namely: Volume, Variety and Velocity.

The Volume represents the ever-growing amount of data, which is generated in today’s “Internet of things”. On the other hand, the Variety of data produced by the multitude of sources like sensors, smart devices and social media in raw, semi-structured, unstructured and rich media formats is further complicating the processing and storage of data. Finally, the Velocity aspect describes how fast the data is retrieved, stored and processed. However, dealing with imprecisely defined data formats, growing data sizes, and requirements with varying processing times represent a new challenge to the current systems. From an information processing perspective, the three characteristics together describe accurately what Big Data is. Nonetheless, apart from the 3Vs, which describe the quantitative characteristic of Big Data systems, there are additional qualitative characteristics like Variability and Veracity. The Variability aspect defines the different interpretations that a certain data can have when put in different contexts. It focuses on the semantics of the data, instead of its variety in terms of structure or representation. The Veracity aspect defines the data accuracy or how truthful it is. If the data is corrupted, imprecise or uncertain, this has direct impact on the quality of the final results. Both variability and veracity have direct influence on the qualitative value of the processed data. The real value obtained from the data analysis, also called data insights, is another qualitative measure which is not possible to define in precise and deterministic way. A graphical representation of the extended V-Model is given in Figure 2.1.

While the 3V model, shown in Figure 2.1, provides a simplified framework which is well understood by researchers and practitioners. This representation of the data processes, can lead to major architectural pitfalls on the design of Big Data platforms. A particular issue that should be taken into account, is the cost factor that derives from the utilization of the 3V model in the context of a business scenario. Since the requirements of business operations are not equal in any vertical market, the influence of the 3Vs in a Big Data implementation process is not the same. Taking this into account, the 3V framework will be used to address particular cases of different requirements and how this can be saturated on top of an existing infrastructure considering the cost factors associated with systems operations and maintenance. This chapter introduces an architectural paradigm to address these different require- ments for Big Data architectures as the heterogeneity paradigm. There are different motivations behind the use of heterogeneous platforms, but recently the following have become very relevant: (a) new hardware capabilities – multi-core CPUs, grow- ing size of main memory and storage, different memory and processing accelerator boards such as GPUs, FPGAs or caches; (b) a growing variety of data-intensive workloads sharing the same host platform; (c) complexity of data structures; (d) geographically distributed server locations and (e) higher requirements in terms of cost, processing and energy efficiency as well as computational speed. Based on the above a discussion is provided on the current technical solutions using the Hadoop ecosystem.

16 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Fig. 2.1.: Visualization of the Extended V-Model (adopted from (E. G. Caldarola, Sacco, & Terkaj, 2014))

To this end, this chapter is structured as follows. The first section discusses the Big Data characteristics and in particular the inequality of the 3Vs and their influence on the cost factor. A brief overview of Cloud Computing in the context of Big Data is also included. The second section of the chapter focuses on the heterogeneity as a feasible architectural paradigm, starting with brief overview of existing heterogeneous system and discussing the emerging Big Data platforms. The third section motivates the heterogeneity paradigm in the current Big Data architectures and dives deeper into each of the four heterogeneity levels: hardware, system management, platform and application. The chapter concludes with short summary of perspectives and open issues for future research.

2.2 Background

2.2.1 Big Data characteristics and Cost Factor

Depending on the system architecture, the understanding of the 3Vs can be different, especially in the case of Volume (size) and Velocity (speed). For example, in traditional OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing) systems the growing data sizes and the need for quick results are becoming more important. The Variety (structure) of data is of no concern in such systems, as the data format is known in advance and described very precisely in a pre-defined schema. However, in the case of Big Data the emerging data variety is

2.2 Background 17 starting to transform the system requirements and question the storage efficiency of existing systems [Hua+13]. Therefore, new architectural approaches like NoSQL [Cat10], NewSQL, MapReduce-based [SLF13], Hybrid OLAP-OLTP [Gru+10; KN11], In-memory and Column-based [Pla09] systems are emerging.

Fig. 2.2.: Visualization of the 3V Cube Model)

A particular architectural requirement is that a system should be capable of handling increasing data volume, high or dynamically changing velocity and high variety of data formats. The exact impact of each of the 3Vs can vary depending on the industry-specific requirements. Therefore, the underlying Big Data infrastructure should be able to deal with any combination of the 3Vs. Furthermore, in this context the Cost-Factor is not a new dimension on the framework but a function that combines the other three, so that a change in each component will affect the value:

Cost Factor=fc(volume, variety, velocity, v. . . )

The 3Vs combined over the Cost-Factor can be expressed in a 3V Cube Model.A visualization of this model is given in Figure 2.2. The Cost-Factor is a very important metric for every Cloud and Big Data service provider [Gre+09]. It determines how efficient is a platform in terms of price per computation unit and how effectively in terms of data storage. This metric includes the costs for hardware, maintenance, administration, electricity, cooling and space. In other words, all essential elements for building Big Data infrastructure are included in the definition of the Cost-Factor characteristics.

18 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures For example the parameter Volume, should include the costs for the storage hardware, the used electricity to power this hardware and the costs for the hardware main- tenance. The Variety parameter should include the costs for the needed hardware and the electricity to transform the data into a format that will be used in a specific context. Where the Velocity parameter can be seen as multiplication factor. The faster the data should be processed the more the system has to be scaled horizontally.

Clearly the Cost-Factor characteristics are the most complex as they consist of the 3Vs together with additional factors that they influence. Therefore, because of this complexity the Big Data platforms are difficult to benchmark and compare in terms of performance and price. The development of a Big Data benchmark is in progress and should be defined by multiple industry cases as described by Baru et al. [Bar+13b; Bar+13a; Bar+14]. Currently as of 2015, there is an urgent need of standardized test workloads against which all software vendors can test their software.

2.2.2 Cloud Computing and Big Data

Cloud Computing has emerged as a major paradigm in the last decade and was quickly adopted by both industry organizations and end users. On the one hand, it offers economic advantages such as a flexible billing model (pay-per-use) as well as potential reductions in infrastructure, administration and license costs. On the other hand, it solves multiple technical challenges, for example, offering optimal resource utilization and management, as well as custom, automated setup and configuration of complex enterprise platforms. A generally accepted definition about Cloud computing is provided by The National Institute of Standards and Technology-NIST [MG+11] as follows:

Cloud Computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. (p. 2) [MG+11]

In such context, cloud providers offer multitude of services starting from the hard- ware layer and moving up to the application layer. The most widely offered services are Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). The variety of offered cloud services is as wide as their architectural approaches [RCL09]. In a related work [IPB12], the authors surveyed and showed that it is feasible to run data-intensive applications on top of virtualized cloud envi- ronment. Cloud services offering relational database storage are called Database as a Service (DaaS) with Amazon RDS, Google CloudSQL and SQL Azure been some major products.

A recent survey [Has+15] looks at the relation between Big Data and Cloud Com- puting. Both are conjoined, with the Cloud Computing providing facilities and services for the Big Data applications. By leveraging virtualization and using Big Data platforms such as Hadoop for parallel, fault-tolerant data processing, cloud providers are able to address the challenges of the 3Vs characteristics. Intel [Big15a] defines this type of services as Analytics as a Service (AaaS). The report describes

2.2 Background 19 the AaaS type as a mix of services based on IaaS, PaaS and SaaS. Similarly, all major cloud providers started offering Big Data services as summarized in Table 2.1.

Another driver of the Big Data services in the cloud [WLS14] is the growing number of sensor and mobile devices also known as Internet-of-Things (IoT), which are unable to process the data locally and have to offload it to a more resourceful environment.

Tab. 2.1.: Big Data Cloud Providers and Services

Provider Service Description Google Cloud Plat- Compute Engine It offers virtual machines with customiz- form [Goo15] able resources for large-scale workloads hosted on top of the Google’s infrastruc- ture. App Engine Google’s fully-managed Platform-as-a- Service (PaaS) for running customer ap- plications. Cloud Datastore It offers automatically scalable storage for non-relational (schemaless) data with sup- port of transactions and SQL-like queries. CloudSQL The service offers relational MySQL database storage, which handles replica- tion, patch management and database management to ensure availability and per- formance. BigQuery It offers capabilities for analyzing multi- terabyte datasets in real-time by running SQL-live queries on the data. Amazon [Ana15] Amazon Simple S3 provides secure, durable, highly- Storage Service scalable object storage. (Amazon S3) Amazon Elastic EBS offers consistent, low-latency storage Block Store (EBS) for virtual machines, which can host differ- ent big data workloads. Amazon Dy- DynamoDB is fast and flexible NoSQL (key- namoDB value & document) database service for consistent, large-scale applications. Amazon Redshift Redshift is a fast, fully managed, petabyte- scale data warehouse for cost-effective and efficient analyze of data. Amazon Elas- EMR provides easy-to-use managed ser- tic MapReduce vice for creating, managing and running (EMR) clusters on top of high- scalable and secure infrastructure using the Amazon EC2.

20 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Continuation of Table 2.1 Provider Service Description Amazon Glacier Glacier offers cost-effective, long period (years or decades) archival storage ser- vices. Pivotal [Fou15] Pivotal Cloud It offers relational database (MySQL), Foundry Hadoop cluster (Pivotal HD), key-value cache/store (Redis), object store (RiakCS) and NoSQL database (MongoDB) services. Rackspace Cloud Big Data It provides Hadoop cluster (Hortonworks [Pla15] Platform Data Platform 2.1) including tools like Hive and Pig. GoGrid [GoG15] 1-Button Deploy It offers multiple storage solutions like Solutions , DataStax Enterprise, Cloudera Enterprise clusters, Founda- tionDB, Hadoop, HBase, MemSQL, Mon- goDB and Riak. Microsoft Hadoop in Azure HDInsight offers Apache Hadoop cluster [Had15] services in the cloud. Infochimps Infochimp Cloud It offers multiple services: 1) [Big15b] Cloud:Streams for streaming data and real-time analytics; 2) Cloud:Queries for NoSQL database and ad hoc, query-based analytics; and 3) Cloud:Hadoop for Elastic Hadoop clusters and batch analytics. Red Hat [Ser15] OpenShift OpenShift offers Platform-as-a-Service (PaaS) services for developing, hosting and scaling applications. End of Table

2.3 Challenges in Big Data Architectures

While cloud architectures provide a solid framework for addressing big data chal- lenges a general issue arises in understanding the problem. There are multiple reasons motivating the need for new platform architectures, but recently the follow- ing have become very relevant:

• Many new hardware capabilities – multi-core CPUs, growing size of main memory and storage and different memory and processing accelerator boards such as GPUs, FPGAs and caches;

• Growing variety of data-intensive workloads sharing the same host platform;

• Complexity of data structures;

2.3 Challenges in Big Data Architectures 21 • Geographically distributed server locations and

• Higher requirements in terms of cost, processing and energy efficiency as well as computational speed.

These challenges are gradually becoming relevant for both private and public cloud platforms of any size. Therefore, in this chapter the heterogeneity paradigm is introduced as a feasible technique to better understand and tackle the complexity and challenges of the emerging Big Data architectures. The following subsections provide an introduction to heterogeneous systems and how these are reflected in the emerging Big Data ecosystem.

2.3.1 Heterogeneous Systems

Heterogeneous systems have been the topic of multiple research studies trying to classify them according to their properties and the workloads for which they are best suitable. However, due to the rapidly changing hardware and software system architectures, the concepts of heterogeneity in the platforms have also evolved with the time.

For example, a survey by Khokhar et al. [Kho+93] defines Heterogeneous Computing (HC) as:

A well-orchestrated, coordinated effective use of a suite of diverse high-performance machines (including parallel machines) to provide fast processing for computationally demanding tasks that have diverse computing needs.(p. 19) [Kho+93]

In addition, the authors discuss multiple issues and problems stemming from system heterogeneity among which are three very general:

• “the types of machines available and their inherent computing characteristics”;

• “alternate solutions to various sub-problems of the applications” and

• “the cost of performing the communication over the network”.

In another survey on Heterogeneous Computing by Ekmecic et al. [ETM96] the authors discuss the heterogeneous workloads as a major factor behind the need of heterogeneous platforms and divide the heterogeneous computing in three essential phases:

1. parallelism detection,

2. parallelism characterization and

3. resource allocation.

In the parallelism detection phase, every task in a heterogeneous application is checked if parallelization is possible. The computation parameters of the tasks are

22 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures estimated in the parallelism characterization phase. Time and place of execution of the tasks is determined in the resource allocation phase. Basically, the three phases describe in a more abstract way today’s concept of cloud computing.

In a similar study Venugopal et al. [VBR06], present a taxonomy of Data Grids, and highlight heterogeneity as an essential characteristic of data grid environments and applications. Furthermore, they briefly mentioned that heterogeneity can be split in multiple levels like hardware, system, protocol and representation heterogeneity, which resemble very much the presented Data Grid layered architecture.

The characteristics of today’s Big Data platforms as well as the challenges and problems that they represent are very similar to the one discussed in Heterogeneous Computing and Data Grid environments. Therefore, it is a logical step to look in more detail at the concept of system heterogeneity and investigate how it is coupled with the Big Data characteristics. Lee et al. [LK11] discuss the importance of heterogeneity in cloud environments by suggesting a new architecture, that improves the performance and cost-effectiveness. They propose an architecture consisting of 1) long-living core nodes to host both data and computation as well as 2) accelerator nodes that are added to the cluster temporarily when additional power is needed. Then the resource allocation strategy dynamically adjusts the size of each pool of nodes to reduce the cost and improve utilization. Additionally, they present a scheduling scheme, based on the job progress as a shared metric, which provides resource fairness and improved performance.

In a different study Mars et al. [MTH11] investigated micro-architectural hetero- geneity in warehouse-scale computer (WSC) platforms, in that the authors present a new metric called opportunity factor that approximates the application’s potential performance improvement opportunity relative to all other applications and given the particular mix of applications and machine types on which is running. They also introduce opportunistic mapping, which solves the optimization problem of finding the optimal resource mapping for heterogeneity-sensitive applications. Using this technique the performance of a real production cluster was reported to improve by 15%, but it could potentially go up to 70%.

The number of studies related to heterogeneity is growing along the conceptual relevance and challenging problems that it brings. The following section investi- gates the emerging Big Data platforms that are developed to address exactly these challenges.

2.3.2 Emerging Big Data Systems

Monash [One13] addresses the problem that there is no single data store that can be efficient for all usage patterns. This issue was previously discussed by Stonebraker et al. [Sto+07], who proposed a taxonomy of database technologies. Interestingly enough, one of the described platforms has MapReduce style architecture [DG08] and looks very similar to Apache Hadoop. However, the message here is that the illusion of having one general purpose system that can handle all types of workloads is not realistic. The current systems cannot cope with the dynamic changes in application requirements and the 3Vs characteristics, which opens the opportunity

2.3 Challenges in Big Data Architectures 23 for new kinds of storage systems like: NoSQL, NewSQL, MapReduce-based, Hybrid OLAP-OLTP, In-memory and Column-based systems. Most of the approaches in these new systems are inspired by the inefficiency and complexity of the current storage systems. In addition to that, the advancements in hardware (multi-core processors, faster memory and flash storage devices) brought the prices of enterprise hardware down to the level of commodity machines.

Cattell [Cat10] identified six key features of the NoSQL data stores namely: “(1) the ability to horizontally scale “simple operation” throughput over many servers; (2) the ability to replicate and to distribute (partition) data over many servers; (3) a simple call level interface or protocol (in contrast to a SQL binding); (4) a weaker concurrency model than the ACID transactions of most relational (SQL) database systems; (5) efficient use of distributed indexes and RAM for data storage; and (6) the ability to dynamically add new attributes to data records”.

Similarly, Strauch et al. [SSK11] summarizes all the motivations behind the emer- gence of the NoSQL data stores among which are the avoidance of unneeded complexity and expensive object-relational mapping, higher data throughput, ease of horizontal scalability (do not rely on the hardware availability) and offer new functionalities that are more suitable for cloud environments in comparison to the relational databases. Additionally, he presents extensive classification and compari- son of the NoSQL databases by looking into their internal architectural differences and functional capabilities.

Industry perspectives such as the one’s advocated by Fan [CC12] view emerging systems as a transformation between the traditional relational database systems working with CRUD (Create, Read, Update, Delete) data and the CRAP (Create, Replicate, Append, Process) data. His major argument is that the CRUD (structured) data is very different from the CRAP (unstructured) data because of the new Big Data characteristics. The new semi-structured and unstructured data is stored and processed in near real-time and not really updated. The incoming data streams are appended. Therefore, CRAP data has very different characteristics and is not appropriate to be stored in the relational database systems.

Marz [Bi12] discusses the problem of mutability in the existing database architec- tures, which is caused by the Update and Delete operations. They allow human interaction in the system, which changes the data consistency and leads to undesired data corruption and data loss. To avoid this, Marz suggests a new Big Data archi- tecture called Lambda Architecture [MW15], which major principles are human fault-tolerance, data immutability and recomputation. By removing the U and D from CRUD and adding the append functionality similar to Fan [CC12], the data immutability is assured. The raw data is aggregated as it comes and sorted by timestamp which greatly restricts the possibility of errors and data loss caused by human fault-tolerance. The re-computation or data processing is done simply by applying a function over the raw data (query). In addition to that the architecture supports both batch and real-time data processing.

From an ecosystem perspective, Hadoop-style systems - inspired by Google’s MapRe- duce paper [DG08] - have been growing in adoption thanks to their scalability, fault-tolerance and distributed parallel processing capabilities. The fact that such

24 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures systems can be built on commodity hardware and its licensing model provide an important advantage over commercial vendors. In a similar spirit of innovation, most of the new infrastructure architectures try to solve only a predefined set of problems, bound to specific use case scenarios and ignore the other general system requirements. Therefore, a typical design approach is to combine two or more system features and build a new hybrid architecture which improves the performance for the targeted use case, but adds an additional complexity. HadoopDB [Abo+09] is such hybrid system, trying to combine the best features of the MapReduce-based systems and the traditional analytical DBMS, by integrating PostgreSQL as the database layer, Hadoop as the distributed communication layer and Hive as a translation layer. Other systems just iteratively improve an existing platform like Haloop [Bu+10; Bu+12] and Hadoop++ [Dit+10] which further improve the Hadoops’ scheduling and caching mechanisms as well as indexing and joining processing. Also Starfish [Her+11] extends Hadoop by enabling it to automatically adapt and self-tune depending on the user workload and in this way provide better performance. A comprehensive survey by Sakr et al. [SLF13] on the family of MapReduce frame- works provides an overview of approaches and mechanisms for large scale data processing.

In a recent work Qin et al. [Qin+13] identify the MapReduce computing model as a de-facto standard which addresses the challenges stemming from the 3Vs characteristics. Furthermore, the authors divide the enterprise Big Data platforms in three categories: (1) Co-Exist solutions; (2) SQL with MapReduce Support solutions; and (3) MapReduce with SQL Support solutions. In the first category they put IBM Big Data Platform and Oracle Big Plan as both offer end-to-end solutions consisting of several data management and processing components. In the second category fall systems integrating Hadoop support like PolyBase [DeW+13], EMC Greenplum and TeraData Aster Data. In the last category fall Hadoop systems that integrate SQL support using Drill, Hive, Hortonworks Stinger, Cloudera Impala and similar. Having provided the case for heterogeneity, the chapter proceeds with highlighting the case of heterogeneity in Big Data Systems by describing the different layers in the sections that follow.

2.4 Heterogeneity in Big Data Systems

The growing number of new Big Data technologies as outlined in the previous section, accompanied by their complexity and specific functionality makes it difficult to clearly classify and categorize them. Multiple studies, summarized in Table 2.2, have investigated and developed different classifications, categorizations and taxonomies in order to make the Big Data field more understandable. The majority of the studies focus on a particular feature or technical functionality and does not depict the system complexity and variety in a general architectural overview. One of the main reasons for this is that the new systems consist of multiple components each with specific functionality, which makes a possible representation hard and unintuitive to illustrate. At the same time, having such Big Data architectural overview will help to deeper understand the different layers and the interconnection between its components. We call this concept the heterogeneity paradigm. In its essence the idea is to help the system architects and developers to better understand

2.4 Heterogeneity in Big Data Systems 25 the various challenges caused by the new Big Data characteristics and the inability to define a unified architecture for all prominent use cases. This work focuses on heterogeneity paradigm in the Hadoop Ecosystem.

Tab. 2.2.: Classifications of Big Data Systems

Title and Authors Description Toward Scalable Systems for The authors first present the history and defini- Big Data Analytics: A Technol- tions of Big Data, then continue with a map of ogy Tutorial [Hu+14] the Big Data technologies dividing them into four phases: Generation, Acquisition, Storage and Ana- lytics. They also suggest a Big Data layered archi- tecture consisting of infrastructure, computing and application layers. The rise of “big data” on cloud The authors motivate the classification of the Big computing: Review and open Data technologies with the large-scale data in the research issues [Has+15] cloud. They identify five aspects: i) data source, ii) content format, iii) data stores, iv) data staging, and v) data processing. Additionally, they present case studies, discuss Big Data research challenges and open issues. Deciphering Big Data Stacks: The authors present a Big Data Application and An Overview of Big Data Tools Libraries classification consisting of six functional [LSA14] categories. Additionally, they provide an abstract Big Data analysis stack consisting of four layers: i) Cloud resources, ii) Processing engines, iii) Applica- tions and iv) Data analysis. Survey on Large-Scale Data The authors present a comprehensive taxonomy Management Systems for Big based on various aspects of the large-scale data Data Applications [WYY15] management systems covering the data model, the system architecture and the consistency model. Towards HPC-ABDS: An Initial The authors give an extensive overview of the High-Performance Big Data Apache Big Data Stack and propose an integration Stack [Qiu+14] with High Performance Big Data Stack. The current version of the stack can be found under http://hpc- abds.org/kaleidoscope .

The term heterogeneity is often used in the context of Big Data to represent the variety of data sources and data formats [CPC15; Has+15; Jag+14]. To cope with this challenge, the authors of the BigDawg architecture [Dug+15] propose a multi- storage reference implementation consisting of streaming, array and relational stores. In the same line of thought, the heterogeneity paradigm that we introduced does not only exist on the storage layer, but on each layer of a Big Data system. For example, data, stream or graph processing technologies can be used depending on the use case. They are core components of a Big Data platform and represent the functional heterogeneity in this layer. Based on the concept of heterogeneity an abstract view of a Big Data architecture was defined and presented in Table 2.3. The architecture consists of four layers, which are also called levels: hardware, management, platform

26 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures and application. This division in levels is not strict, but represents the major features and functionality of the components in a Big Data platform.

Tab. 2.3.: Abstract Big Data Architecture

Heterogeneity Abstract Big Data Architecture Level Data, Stream & Graph Analyt- Content Analysis ics Machine Learning Procedural Language Application Application Framework Search Engine SQL-on-Hadoop Data Modeling Data Acquisition Library Collection Data Collection Data Governance Data Serialization Machine Learning Framework Data Layout Workflow Scheduling Platform In-Memory Storage Execution Framework Data & Graph Storage Data, Stream & Graph Process- ing System Interfaces Cloud Application Deploy- Application Management ment Management Distributed Coordination Messaging Management Cluster Monitoring & Management Virtualization-based & Container-based Resource Manage- ment Memory Type & Size CPU Type & Number of Cores Hardware Storage Type & Size Accelerator Modules

The hardware layer represents the server components of the system and the fact that they can vary in storage, memory and processor type and size. The management layer is dealing with the system resource management and offers services to the applications running on the upper layers. The platform layer represents the main storage and processing services that a Big Data platform provides. Finally, the application layer is hosting the variety of Big Data applications running on top of the services provided by the lower layers. The rest of this section looks deeper into each level by enlisting the respective technology components and their categories.

2.4.1 Hardware Level

Undoubtedly recent advances in the processing and storing capabilities of the current commodity (off-the-shelf) servers have drastically improved while at the same time becoming cheaper [PC06]. This reduces the overall cost of large-scale clusters consisting of thousands of machines and enables the vendors to cope with the exponentially growing data volumes, as well as the velocity with which the data

2.4 Heterogeneity in Big Data Systems 27 should be processed. However, there have been other components like FPGAs, GPUs, accelerator modules and co-processors which have become part of the enterprise- ready servers. They offer numerous new capabilities which can further boost the overall system performance such as:

• optimal processing of calculation intensive application;

• offloading part or entire CPU computations to them;

• faster and energy efficient parallel processing capabilities; and

• improved price to processing ratio compared to standard CPUs.

Recently, there have been multiple studies investigating how these emerging com- ponents can be successfully integrated in the Big Data platforms. In [Sha+10], the authors present a MapReduce framework (FPMR) implemented on FPGA that achieves 31.8x speedup compared to CPU-based software system. [KC14] investigate the performance improvements of using Solid State Drives (SSDs) as an alternative to hard-disk drives and conclude that SSDs can achieve up to 70% higher perfor- mance for the 2.5x higher cost-per-performance. Similarly, [Kan+13] show that sorting in Hadoop with SSDs can be more than 3 times faster and reduce drastically the power consumption compared to hard disks.

Diversifying the core platform components motivates the investigation of the concept of heterogeneity on a hardware level and the new challenges that it introduces. Using the right hardware modules for a particular application can be crucial for obtaining the best price-performance ratio.

2.4.2 Management Level

As seen in Table 2.3 the management layer is positioned directly above the hardware level. It is responsible for the management and optimal allocation and usage of the underlying hardware components. There are multiple ways to achieve this:

• directly installing operating system,

• using a container technology (container-based virtualization),

• using a virtualization technology (hypervisor-based virtualization) and

• utilizing a hybrid solution between OS and virtualization.

Tab. 2.4.: Management Level Component

Type Tools Description Virtualization- Serengeti/ It is an open-source project, initiated by VMware, based Resource Big Data to enable the rapid deployment of Hadoop Management Extensions (HDFS, MapReduce, Pig, Hive, and HBase) on a virtual platform (vSphere). [18y; 14e; 18aj]

28 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Continuation of Table 2.4 Type Tools Description Sahara/ Sa- It aims to provide users with simple means to vanna provision a Hadoop cluster at OpenStack by spec- ifying several parameters like Hadoop version, cluster topology, nodes hardware details and a few more. [Sah14] Cluster Resource Mesos A cluster manager that provides efficient re- Management source isolation and sharing across distributed applications, or frameworks like Hadoop, MPI, Hypertable, Spark, and other applications. [Hin+11] YARN YARN (Yet Another Resource Negotiator/MapRe- duce 2.0) is a framework for job scheduling and cluster resource management. [Vav+13] Container-based Docker It is an open platform for developers and sysad- Resource Manage- mins to build, ship, and run distributed applica- ment tions. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applica- tions and automating workflows. [Doc19] LXC/Linux LXC provides operating system-level virtualiza- Containers tion through a virtual environment that has its own process and network space, instead of creat- ing a full-fledged virtual machine. LXC relies on the Linux kernel cgroups functionality that was released in version 2.6.24. CoreOS CoreOS is an open source lightweight operating system based on the Linux kernel and designed for providing infrastructure to clustered deploy- ments, while focusing on automation, ease of applications deployment, security, reliability and scalability. Cluster Monitor- Ambari It provides an intuitive, easy-to-use Hadoop man- ing & Manage- agement web UI backed by its RESTful for ment provisioning, managing, and monitoring Apache Hadoop clusters Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed re- sources hosted on a cluster of nodes. Helix au- tomates reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration.

2.4 Heterogeneity in Big Data Systems 29 Continuation of Table 2.4 Type Tools Description Application Man- Cloudera Cloudera Manager is application management agement Manager tool for the Cloudera Hadoop Distribution. It automates the administration, installation, con- figuration and deployment of cluster applications as well as offers monitoring and diagnostic capa- bilities. Cloud Application Whirr Whirr is a set of libraries for running cloud ser- Deployment vices. It provides a cloud-neutral way to run services, a common service API and can be used as a command line tool for deploying clusters. JCloud Jcloud is an open source multi-cloud toolkit for the Java platform. It provides functionality to create and control portable applications across clouds using their cloud-specific features. Distributed Coor- ZooKeeper A centralized service that enables highly reli- dination able distributed coordination by maintaining configuration information, naming, providing distributed synchronization, and group services. [Hun+10; JR09] Messaging Man- Kafka It is a distributed messaging system for collecting agement and delivering high volumes of log data with low latency. [KNR+11] System Interfaces Hue Hue is a Web interface for analyzing data with Apache Hadoop. It supports a file and job browser, Hive, Pig, Impala, Spark, Oozie edi- tors, Solr Search dashboards, Hbase, Sqoop2, and more. End of Table

In the recent years, virtualization has become the standard technology for infras- tructure management both for bigger cloud and datacenter providers as well as for smaller private companies [Sta+08]. However, along with the multiple benefits that virtualization brings, there are also new challenges. The co-location of virtual machines hosting different application workloads on the same server makes the effective and fair resource allocation problematic. Also the logical division of virtual machines with similar characteristics is not always possible. In the case of Big Data platforms with changing workloads, it is difficult to meet the network and storage I/O guarantees. Therefore, the container-based virtualization, which comes at much smaller overhead as it is directly supported by the operating systems, has become very popular alternative. Virtualization technologies provide better resource sharing and isolation in exchange to a higher overhead, whereas container-based systems achieve near-native performance but offer poor security and isolation [XNR14].

30 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures The Serengeti project [18y; 14e] is one of the first initiatives to automate the management, starting, stopping and pre-configuring of Hadoop clusters on the fly. It is an open source project started by VMware and now integrated in vSphere as Big Data Extension, which has the goal to ease the management of virtualized Hadoop clusters. By the implementation of hooks to all major Hadoop modules, it is possible to know the exact cluster topology and make it aware of the hypervisor layer. This open source module is called Hadoop Virtual Extension (HVE) [14e]. Very interesting is the new ability to define the nodes (virtual machines) as either only compute or data nodes. The above implies that some nodes are storing the data in HDFS, while others are responsible for the computation of MapReduce jobs. Another very similar project, called Sahara [18v], was developed as part of the OpenStack platform.

At the same time, there are variety of other technologies, which help and improve the management of a Big Data environment, such as monitoring, deployment, coordination, messaging and resource scheduling tools. An extensive list of such tools together with short description is provided in Table 2.4.

2.4.3 Platform Level

The platform layer represents the actual Big Data platform which is responsible for the provision of general data and processing capabilities. In the last years Apache Hadoop has become the de facto platform for Big Data. It has two core components: HDFS and YARN (MapReduce 2.0). HDFS is responsible for the data storage, whereas YARN is for the processing and resource allocation between the jobs. More recently, Yahoo released the Storm-YARN [Sto13] application which combines the advantages of both applications: real-time (low-latency) and batch processing. It enables Storm applications to utilize the Hadoop resources managed by YARN, which will offer new abilities for faster and more optimal data processing. The Spark platform developed by Zaharia et al. [Zah+10; Zah+12a] is built on top of HDFS and introduces the concept of Resilient Distributed Datasets (RDDs). RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of MapReduce-like parallel operations (iterative machine learning algorithms and interactive data analytics).

The above are just a few examples of the existing platforms for data storage and processing. The question “How to choose the right framework for a specific use case?” is very important, but one needs sufficient background knowledge in order to answer it, as pointed out by Grover [Pro13]. In his post, he discusses and categorizes the different frameworks which can be run on top of HDFS. This complies with the chapter’s goal, on providing an overview of the variety of frameworks in the platform layer. Table 2.5 provides a list of components, grouped by their functionality types. In the upper part are the storage components (Data, Graph and In-memory storage), followed by multiple processing frameworks (Data, Stream and Graph processing) and data tools. In addition, there are execution and machine learning frameworks as well as tools for workflow management.

2.4 Heterogeneity in Big Data Systems 31 The list of new frameworks and tools is constantly growing as are the new application requirements of the upper layer. Therefore, the importance of understanding the heterogeneity on this platform level is very essential for the successful management and processing of large datasets.

Tab. 2.5.: Platform Level Components

Type Tools Description Data Storage HDFS Apache HDFS (Hadoop Distributed File System) is a distributed file system that provides high- throughput access to application data. [Bor+08] Hbase Apache Hbase is the Hadoop database, a dis- tributed, scalable, big data store. It is used for random, realtime read/write access to your Big Data and is modeled after Google’s Bigtable [Cha+08; Geo11] Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. It is based on Google’s Bigtable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Hypertable Hypertable is open source,scalable, distributed key/value store based on Google’s Bigtable de- sign, running on top of Hadoop. Cassandra Apache Cassandra’s data model offers the conve- nience of column indexes with the performance of log-structured updates, strong support for de- normalization and materialized views, and pow- erful built-in caching. Phoenix is high performance relational database layer over Hbase for low latency appli- cations. Graph Storage Titan Titan is a scalable graph database optimized for storing and querying graphs containing hun- dreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a trans- actional database that can support thousands of concurrent users executing complex graph traver- sals in real time. In-Memory Stor- Tachyon Tachyon is an open source, memory-centric dis- age tributed file system enabling reliable file shar- ing at memory-speed across cluster frameworks, such as Spark and MapReduce. [Li+13a; Li+14] Data Governance Cloudera Cloudera Navigator offers comprehensive audit- Navigator ing across Hadoop cluster by defining and auto- matically collecting data lifecycle activities such as retention and encryption policies.

32 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Continuation of Table 2.5 Type Tools Description Falcon Apache Falcon is a data processing and man- agement solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters. Data Collection Chukwa Apache Chukwa is a data collection system for managing large distributed systems. It also in- cludes a exible and powerful toolkit for display- ing, monitoring and analyzing results to make the best use of the collected data. [Bou+08; RK10] Data Serialization Avro is a data serialization system. It provides: 1) rich data structures; 2) a compact, fast, binary data format; 3) a container file, to store persistent data; 4) remote procedure call (RPC) and 5) simple integration with dynamic languages. Data Layout Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing frame- work, data model or programming language. Data Processing MapReduce A YARN-based system for parallel processing of large data sets. [DG08] Spark Apache Spark is an open source cluster com- puting system that aims to run programs faster by providing primitives for in-memory cluster computing. Jobs can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce. [Zah+10; Zah+12a] Stream Processing Storm An open source distributed real-time computa- tion system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. [LES12] Storm- It enables Storm applications to utilize the com- YARN putational resources in a Hadoop-YARN cluster along with accessing Hadoop storage resources such as Hbase and HDFS. [Sto13]

2.4 Heterogeneity in Big Data Systems 33 Continuation of Table 2.5 Type Tools Description Samza is a distributed stream processing framework. It uses Kafka for messaging, and Hadoop YARN to provide fault tolerance, pro- cessor isolation, security, and resource manage- ment. S4 Apache S4 is a general-purpose, distributed, scal- able, fault-tolerant, pluggable platform that al- lows programmers to easily develop applications for processing continuous unbounded streams of data. Spark Spark Streaming makes it easy to build scalable Streaming fault-tolerant streaming applications by using the Spark’s language-integrated API, which supports Java, Scala and Python. [Zah+13] Workflow Oozie is a workflow scheduler system to Scheduling manage Apache Hadoop jobs. [Isl+12a] Execution Frame- REEF REEF (Retainable Evaluator Execution Frame- work work) framework builds on top of YARN to pro- vide crucial features (Retainability, Composabil- ity, Cost modeling, Fault handling and Elasticity) to a range of different applications. [Chu+13] Graph Processing Giraph is an iterative graph processing system built for high scalability. It originated as the open-source counterpart to Pregel [Mal+10], the graph processing architecture developed at Google. GraphX GraphX is Apache Spark’s API for graphs and graph-parallel computation. It unifies ETL, ex- ploratory analysis, and iterative graph computa- tion within a single system. [Gon+14] Dato/ GraphLab is an open source, graph-based, high GraphLab performance, distributed computation frame- work written in C++. [Low+12] Pegasus PEGASUS is a Peta-scale graph mining system, fully written in Java. It runs in parallel, dis- tributed manner on top of Hadoop. [KTF09] Machine Learning Oryx The Oryx open source project provides simple, Framework real-time large-scale machine learning / predic- tive analytics infrastructure. It implements a few classes of algorithm commonly used in business applications: collaborative filtering / recommen- dation, classification / regression, and clustering. [215]

34 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Continuation of Table 2.5 Type Tools Description MLbase MLbase is a platform for Implementing and con- suming Machine Learning techniques at scale, and consists of three components: MLlib, MLI, ML Optimizer. MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. [Tal+12; Kra+13] H2O H2O is an open source platform, offering ma- chine learning algorithms for classification and regression over BigData. It is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON. [H2O15] End of Table

2.4.4 Application Level

Satisfying all the Big Data application characteristics requires the platform to support all types of components starting from the data retrieval, aggregation and processing including data mining and analytics. Moreover, applications with very different characteristics should be able to run effectively co-located on the same platform, which should further guarantee optimal resource and functionality management, fair scheduling and workload isolation. These requirements outline the importance of understanding the heterogeneity on the application level. To achieve these, the variety of existing technologies and their features should be thoroughly investigated and understood. Table 2.6 summarizes major part of the tools in the Hadoop Ecosystem, grouping them according to their functionality type.

In the first category defined as data acquisition are tools used to move and store data into Hadoop. [TC13] and Flume are the most widely used tools for data acquisition. The second category, called SQL-on-Hadoop, represents the variety of Data Warehousing, Business Intelligence, ETL (Extract-transform-load) [Had13] and reporting capabilities offered by the applications on top of Hadoop. Hive [Thu+09; Thu+10] is the most popular application in this category. It is a data warehouse infrastructure on top of Hadoop that provides data summarization and ad-hoc querying in SQL-like language, called HiveQL.

Another important category is the application frameworks, which offer ready to use packages, libraries and tools for building custom Big Data applications. The search engine category enlist components for enabling full-text search capabilities on top of Hadoop.

The last categories include different analytics types (Data, Graph and Stream analyt- ics), machine learning and content analysis components, implementing specific use case functionalities.

2.4 Heterogeneity in Big Data Systems 35 Tab. 2.6.: Application Level Components

Type Tools Description Data Acquisition Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. [TC13] Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. SQL-on-Hadoop Hive A data warehouse infrastructure that provides data summarization and ad hoc querying. [Thu+09; Thu+10] HCatalog A set of interfaces that open up access to Hive’s metastore for tools inside and outside of the Hadoop grid. It is now part of Hive. [CRW12] Impala It is an open source Massively Parallel Process- ing (MPP) query engine that runs natively on Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and HBase with- out requiring data movement or transformation. [Kor+15] Big SQL Big SQL is a massively parallel processing (MPP) (IBM) SQL engine that deploys directly on the physical Hadoop Distributed File System (HDFS) cluster. This SQL engine pushes processing down to the same nodes that hold the data. SparkSQL A fully Hive-compatible data warehousing on top (Shark) of Spark system that can run 100x faster than Hive. [Eng+12; Xin+13] Drill is an open-source software frame- work (inspired by Google’s Dremel) that supports data-intensive distributed applications for inter- active analysis of large-scale datasets. [HN13] Tajo A relational and distributed data warehouse sys- tem for Hadoop, that is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging ad- vanced database techniques. [Cho+13a] Presto Presto is an open source distributed SQL query (Facebook) engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. [Pre]

36 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Continuation of Table 2.6 Type Tools Description HAWK HAWQ is a parallel SQL query engine that com- [HAW15; bines the Pivotal Analytic Database with the scal- 18f] ability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. It delivers performance, linear scalability and pro- vides tools interaction with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. MRQL Apache MRQL (pronounced miracle) is a query processing and optimization system for large- scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark. BlinkDB BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade- off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. [Aga+12; Aga+13] Library Collection DataFu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well- tested libraries for data mining and statistics. [HS13] Data Modeling Gora Apache Gora is an open source framework that provides an in-memory data model and persis- tence for big data. It supports persisting to col- umn stores, key/value stores, document stores and RDBMSs, and analyzing the data with exten- sive MapReduce support. Kite Kite is a high-level data layer for Hadoop. It is an API and a set of tools that speed up development by enabling you to configure how Kite stores your data in Hadoop. [Kit15] Application Tez Apache Tez is a general-purpose resource man- Framework agement framework which allows for a complex processing of directed-acyclic-graph of tasks and is built atop Hadoop YARN. [Tez15]

2.4 Heterogeneity in Big Data Systems 37 Continuation of Table 2.6 Type Tools Description Cascading Cascading is an open source application develop- ment platform for building data applications on Hadoop. It is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clo- jure, etc.), hiding the underlying complexity of MapReduce jobs. Flink features powerful programming ab- (Strato- stractions in Java and Scala, a high-performance sphere) runtime, and automatic program optimization. It has native support for iterations, incremental it- erations, and programs consisting of large DAGs of operations. [Ale+14] Crunch Apache Crunch Java library provides a frame- work for writing, testing, and running MapRe- duce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Search Engine Lucene is an open source, high- performance, full-featured text search engine library written entirely in Java. It is a technol- ogy suitable for nearly any application that re- quires full-text search, especially cross-platform. [Bu+10] Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, repli- cation and load-balanced querying, automated failover and recovery, centralized configuration and more. It is built on Apache Lucene. Nutch is an open source web search en- gine based on Lucene and Java for the search and index component. It has a highly modu- lar architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. [Kha+04] Elasticsearch Elasticsearch is an open source, search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Machine Learning Mahout A scalable machine learning and data mining library. [OO12]

38 Chapter 2 The Heterogeneity Paradigm in Big Data Architectures Continuation of Table 2.6 Type Tools Description Data Analytics Hama is an open source project, allow- ing you to do advanced analytics beyond MapRe- duce. Stream Analytics SAMOA Apache SAMOA is distributed streaming machine learning (ML) framework that contains a pro- graming abstraction for distributed streaming ML algorithms. [Mor13] Procedural Lan- Pig A high-level data-flow language and execution guage framework for parallel computation. [Gat+09; Ols+08a] Content Analysis Tika is a toolkit that detects and extracts metadata and structured text content from var- ious documents using existing parser libraries. [MZ11] Graph Analytics Faunus Faunus is a Hadoop-based graph analytics engine for analyzing graphs represented across a multi- machine compute cluster. End of Table

2.5 Summary and Future Research Directions

In this chapter the new concept of heterogeneity was introduced in relation with the design and implementation of Big Data platforms and discussed how the existing tools comprising the Hadoop ecosystem adapt on these challenges. The emergence of new analytical applications opens new Big Data challenges both for researchers and practitioners [Zic14]. These challenges are not only in relation with Data Characteristics (quality, availability, discovery and comprehensiveness), but also in terms of Data Processing (cleansing, capturing, and modeling) and Data Management (privacy, security and governance). Such an evaluation framework should be able to give the technology guidelines on how to build the best cost-performance Big Data platform for both vertical and generic data processing workloads. This chapter provided an overview of a generic heterogeneous approach for addressing the above challenges. Nevertheless such as approach also has its limitations since the current perspective on the Hadoop ecosystem is subject to continuous development. However, architectural patterns become more relevant once an overall system architecture has been proven to work. This theoretical overview can provide a solid application especially for the challenges addressed by practitioners in the area where fast and effective processing of data is of strategic importance.

2.5 Summary and Future Research Directions 39

Evaluation of Big Data Platforms 3 and Benchmarks

This chapter is structured as follows:

• Section 3.1 investigates the performance of Big Data allocations in virtualized (e.g. Hadoop) environments, stress testing the management and platform layers.

• Section 3.2 investigates the performance of NoSQL databases versus Hadoop distributions.

• Section 3.3 executes and evaluates how the TPCx-HS benchmark stresses the hardware, management and platform layers.

• Section 3.4 evaluates and compares Hive and Spark SQL engines using bench- mark queries.

• Section 3.5 investigates and evaluates the impact of compression techniques on SQL-on-Hadoop engine performance.

Figure 3.1 illustrates the tight integration between the hybrid benchmark methodology and the different Big Data Architecture layers (in blue) mapped to the heterogeneity levels, defined in Chapter 2.

Fig. 3.1.: The Hybrid Benchmark Methodology integrated with the different Big Data Architecture Layers mapped to the heterogeneity levels.

41 3.1 Performance Evaluation of Virtualized Hadoop Clusters

Abstract

This work investigates the performance of Big Data applications in virtualized Hadoop environments, hosted on a single physical node. An evaluation and per- formance comparison of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented. Our experiments show how different Data-Compute Hadoop cluster con- figurations, utilizing the same virtualized resources, can influence the performance of CPU bound and I/O bound workloads. Based on our observations, we identify three important factors that should be considered when configuring and provisioning virtualized Hadoop clusters. The paper is based on the following publications:

• Todor Ivanov, Roberto V. Zicari, Sead Izberovic, Karsten Tolle, Performance Evaluation of Virtualized Hadoop Clusters, Frankfurt Big Data Lab, Technical Report No.2014-1, arXiv:1411.3811, [Iva+14b].

• Todor Ivanov, Roberto Zicari, Alejandro Buchmann, Benchmarking Virtu- alized Hadoop Clusters, in Proceedings of the 5th Workshop on Big Data Benchmarking (WBDB 2014), August 2014, Potsdam, Germany, [IZB14].

Keywords: Big Data, Benchmarking, Hadoop, Virtualization.

3.1.1 Introduction

Apache Hadoop [18d] has emerged as the predominant platform for Big Data applications. Recognizing this potential, Cloud providers have rapidly adopted it as part of their services (IaaS, PaaS and SaaS)[SM13]. For example, Amazon, with its Elastic MapReduce (EMR) [18b] web service, has been one of the pioneers in offering Hadoop-as-a-service. The main advantages of such cloud services are quick automated deployment and cost-effective management of Hadoop clusters, realized through the pay-per-use model. All these features are made possible by virtualization technology, which is a basic building block of the majority of public and private Cloud infrastructures [RCL09]. However, the benefits of virtualization come at a price of an additional performance overhead. In the case of virtualized Hadoop clusters, the challenges are not only the storage of large data sets, but also the data transfer during processing. Related works, comparing the performance of a virtualized Hadoop cluster with a physical one, reported virtualization overhead ranging between 2-10% depending on the application type [Bue13], [Mic13]. However, there were also cases where virtualized Hadoop performed better than the physical cluster, because of the better resource utilization achieved with virtualization.

In spite of the hypervisor overhead caused by Hadoop, there are multiple advantages of hosting Hadoop in a cloud environment [Bue13], [Mic13] such as improved scalability, failure recovery, efficient resource utilization, multi-tenancy, security, to

42 Chapter 3 Evaluation of Big Data Platforms and Benchmarks name a few. In addition, using a virtualization layer enables to separate the compute and storage layers of Hadoop on different virtual machines (VMs). Figure 3.2 depicts various combinations to deploy a Hadoop cluster on top of a hypervisor. Option (1) is hosting a worker node in a virtual machine running both a TaskTracker and NameNode service on a single host. Option (2) makes use of the multi-tenancy ability provided by the virtualization layer hosting two Hadoop worker nodes on the same physical server. Option (3) shows an example for functional separation of compute (MapReduce service) and storage (HDFS service) in separate VMs. In this case, the virtual cluster consists of two compute nodes and one storage node hosted on a single physical server. Finally, option (4) gives an example for two separate clusters running on different hosts. The first cluster consists of one data and one compute node. The second cluster consists of a compute node that accesses the data node of the first cluster. These deployment options are currently supported by Serengeti [18y], a project initiated by VMWare, and Sahara [18v], which is part of the OpenStack [18w] cloud platform.

Fig. 3.2.: Options for Virtualized Hadoop Cluster Deployments

In this report we investigate the performance of Hadoop clusters, deployed with separated storage and compute layers (option (3)), on top of a hypervisor managing a single physical host. We have analyzed and evaluated the different Hadoop cluster configurations by running CPU bound and I/O bound workloads.

The report is structured as follows: Section 3.1.2 provides a brief description of the technologies involved in our study. An overview of the experimental platform, setup test and configurations are presented in Section 3.1.3. Our benchmark methodology is defined in Section 3.1.4. The performed experiments together with the evaluation of the results are presented in Section 3.1.5. Finally, Section 3.1.6 concludes with lessons learned.

3.1.2 Background

Big Data has emerged as a new term not only in IT, but also in numerous other industries such as healthcare, manufacturing, transportation, retail and public sector administration [Man+11; Jag+14] where it quickly became relevant. There is still

3.1 Performance Evaluation of Virtualized Hadoop Clusters 43 no single definition which adequately describes all Big Data aspects [Hu+14], but the “V” characteristics (Volume, Variety, Velocity, Veracity and more) are among the widely used one. Exactly these new Big Data characteristics challenge the capabilities of the traditional data management and analytical systems [Hu+14; IKZ13]. These challenges also motivate the researchers and industry to develop new types of systems such as Hadoop and NoSQL databases [Cat10].

Apache Hadoop [18d] is a software framework for distributed storing and pro- cessing of large data sets across clusters of computers using the map and reduce programming model. The architecture allows scaling up from a single server to thou- sands of machines. At the same time Hadoop delivers high-availability by detecting and handling failures at the application layer. The use of data replication guarantees the data reliability and fast access. The core Hadoop components are the Hadoop Distributed File System (HDFS) [Bor07; Shv+10] and the MapReduce framework [DG08]. HDFS has master/slave architecture with a NameNode as a master and multiple DataNodes as slaves. The NameNode is responsible for the storing and managing of all file structures, metadata, transactional operations and logs of the file system. The DataNodes store the actual data in the form of files. Each file is split into blocks of a pre-configured size. Every block is copied and stored on multiple DataNodes. The number of block copies depends on the Replication Factor.

MapReduce is a software framework, that provides general programming interfaces for writing applications that process vast amounts of data in parallel, using a dis- tributed file system, running on the cluster nodes. The MapReduce unit of work is called job and consists of input data and a MapReduce program. Each job is divided into map and reduce tasks. The map task takes a split, which is a part of the input data, and processes it according to the user-defined map function from the MapReduce program. The reduce task gathers the output data of the map tasks and merges them according to the user-defined reduce function. The number of reducers is specified by the user and does not depend on input splits or number of map tasks. The parallel application execution is achieved by running map tasks on each node to process the local data and then send the result to a reduce task which produces the final output.

Hadoop implements the MapReduce model by using two types of processes – Job- Tracker and TaskTracker. The JobTracker coordinates all jobs in Hadoop and schedules tasks to the TaskTrackers on every cluster node. The TaskTracker runs tasks assigned by the JobTracker. Multiple other applications were developed on top of the Hadoop core components, also known as the Hadoop ecosystem, to make it more ease to use and applicable to variety of industries. Example for such applications are Hive [Thu+10], Pig [Gat+09], Mahout [18g], HBase [Geo11], Sqoop [TC13] and many more.

VMware vSphere [14f] is the leading server virtualization technology for cloud infrastructure, which consisting of multiple software components with compute, network, storage, availability, automation, management and security capabilities. It virtualizes and aggregates the underlying physical hardware resources across multiple systems and provides pools of virtual resources to the datacenter.

44 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Serengeti [18y] is an open source project started by VMware and now part of the vSphere Big Data Extension [14e]. The goal of the project is to enable quick configuration and automated deployment of Hadoop in virtualized environments. The major contribution of the project is the Hadoop Virtual Extension (HVE) [18aj], which makes Hadoop aware that it is virtualized. This new layer integrating hy- pervisor functionality is implemented using hooks that touch all of the Hadoop subcomponents (Common, HDFS and MapReduce) and is called Node Group layer. Additionally, new data-locality related policies are included: replica placement /removal policy extension, replica choosing policy extension and balancer pol- icy extension. According to the VMware report [18p], the benefits of virtualizing Hadoop are: (i) enabling rapid provisioning;(ii) additional high availability and fault tolerance provided by the hypervisor;(iii) improving datacenter efficiency by higher server consolidation;(iv) efficient resource utilization by guaranteeing virtual machines resources;(v) multi-tenancy allowing mixed workloads on the same tenant but still preserving the Quality of Service (QoS) and SLA’s; vi) provides security and isolation between the virtual machines;(vii) enables time sharing by scheduling jobs to run in periods with low hardware usage;(viii) easy maintenance and movement of environment;(ix) enables to run Hadoop-as-a-service in Cloud environment. An- other major functionality that Serengeti introduces for the first time is the ability to separate the compute and storage layers of Hadoop on different virtual machines.

3.1.3 Experimental Environment

Platform

An abstract view of the experimental platform we used to perform the tests is shown in Figure 3.3. The platform is organized in four logical layers which are described below.

Fig. 3.3.: Experimental Platform Layers

Hardware: It consists of a standard Dell PowerEdge T420 server equipped with two Intel Xeon E5-2420 (1.9 GHz) CPUs each with six cores, 32 GB of RAM and four 1 TB, Western Digital (SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drives.

Management (Virtualization): We installed the VMware vSphere 5.1 [14f] plat- form on the physical server, including ESXi and vCenter Servers for automated VM management.

3.1 Performance Evaluation of Virtualized Hadoop Clusters 45 Platform (Hadoop Cluster): Project Serengeti integrated in the vSphere Big Data Extension (BDE) (version 1.0) [14e], installed in a separate VM, was used for automatic deployment and management of Hadoop clusters. The hard drives were deployed as separate data stores and used as shared storage resources by BDE. The deployment of both Standard and Data-Compute cluster configurations was done using the default BDE/Serengeti Server options as described in [LM14]. In all the experiments we used the Apache Hadoop distribution (version 1.2.1), included in the Serengeti Server VM template (hosting CentOS), with the default parameters: 200MB java heap size, 64MB HDFS block size and Replication Factor of 3.

Application (HiBench Benchmark): The HiBench [Hua+10a] benchmark suite was develop by Intel to stress test Hadoop systems. It contains 10 different workloads divided in 4 categories:

1. Micro Benchmarks (Sort, WordCount, TeraSort, Enhanced DFSIO)

2. Web Search (Nutch Indexing, PageRank)

3. Machine Learning (Bayesian Classification, K-means Clustering)

4. Analytical Queries (Hive Join, Hive Aggregation)

For our experiments, we have chosen two MapReduce representative applications from the HiBench micro-benchmarks, namely, the WordCount (CPU bound) and the TestDFSIOEnhanced (I/O bound) workloads. One obvious limitation of our experimental environment is that it consists of a single physical server, hosting all VMs, and does not involve any physical network communication between the VM nodes. Additionally, all experiments were performed on the VMware ESXi hypervisor. This means that the reported results may not apply to other hypervisors as suggested by related work [Li+13b], comparing different hypervisors.

Setup and Configuration

The focus of this report is on analyzing the performance of different virtualized Hadoop cluster configurations, deployed and tested on our platform. Figure 3.4 shows the two types of cluster configurations investigated in this report, namely: Standard and Data-Compute clusters.

Fig. 3.4.: Standard and Data-Compute Hadoop Cluster Configurations

46 Chapter 3 Evaluation of Big Data Platforms and Benchmarks The Standard Hadoop cluster type is a standard Hadoop cluster configuration but hosted in a virtualized environment with each cluster node installed in a separate VM. The cluster consists of one Compute Master VM (running JobTracker), one Data Master VM (running NameNode) and multiple Worker VMs. Each Worker VM is running both TaskTracker and DataNode services. The data exchange is between the TaskTracker and DataNode services inside the VM. On the other hand, the Data-Compute Hadoop cluster type has similarly Compute and Data Master VMs, but two types of Worker nodes: Compute and Data Worker VMs. This means that there are data nodes, running only DataNode service and compute nodes, running only TaskTracker service. The data exchange is between the Compute and Data VMs, incurring extra virtual network traffic. The advantage of this configuration is that the number of data and compute nodes in a cluster can be independently and dynamically scaled, adapting to the workload requirements. The first factor that we have to take into account when comparing the configurations is the number of VMs utilized in a cluster. Each additional VM increases the hypervisor overhead and therefore can influence the performance of a particular application as reported in [Li+13b; Bue13; Ye+12]. At the same time, running more VMs utilizes more efficiently the hardware resources and in many cases leads to improved overall system performance (CPU and I/O Throughput) [Bue13]. The second factor is that all cluster configurations should utilize the same amount of hardware resources in order to be comparable. Taking these two factors into account, we specified six different cluster configurations. Two of the cluster configurations are of type Standard Hadoop cluster and the other four are of type Data-Compute Hadoop cluster. Based on the number of virtual nodes utilized in a cluster configuration, we compare Standard1 with Data-Comp1 and Standard2 with Data-Comp3 and Data-Comp4. Additionally, we added Data-Comp2 to compare it with Data-Comp1 and Data-Comp3. The goal is to better understand how the number of data nodes influences the performance of I/O bound applications in a Data-Compute Hadoop cluster.

Table 3.1 shows the worker nodes for each configuration and the allocated per VM resources (vCPUs, vRAM and vDisks). Three additional VMs (Compute Master, Data Master and Client VMs), not listed in Table 3.1, were used in all of the six cluster configurations. The exact parameters of each cluster configuration are described in a JSON file, which ensures the repeatability of the configuration resources and options. As an example the JSON file of the Data-Comp1 cluster configuration is included in the Appendix. For simplicity, we will abbreviate in the rest, the Worker Node as WN, the Compute Worker Node as CWN and Data Worker Node as DWN.

3.1.4 Benchmarking Methodology

In this section we describe our benchmarking methodology that we defined and used throughout all experiments. The major motivation behind it was to ensure the comparability between the measured results. We started by selecting 2 out of the 10 HiBench [Hua+10a] workloads as listed in Table 3.2. Our goal was to have representative workloads for CPU bound and I/O bound workloads.

Figure 3.5 briefly illustrates the five phases in our experimental methodology, which we call an Iterative Experimental Approach. In the initial Phase 1, all software com-

3.1 Performance Evaluation of Virtualized Hadoop Clusters 47 Tab. 3.1.: Six Experimental Hadoop Cluster Configurations

Configuration Name Worker Nodes Standard1 (Standard 3 Worker Nodes Cluster 1) TaskTracker & DataNode; 4 vCPUs; 4608MB vRAM; 100GB vDisk Standard2 (Standard 6 Worker Nodes Cluster 2) TaskTracker & DataNode; 2 vCPUs; 2304MB vRAM; 50GB vDisk Data-Comp1 (Data- 2 Compute Worker Nodes 1 Data Worker Node Compute Cluster 1) TaskTracker; 5 vCPUs; DataNode; 2 vCPUs; 4608MB vRAM; 50GB 4608MB vRAM; 200GB vDisk vDisk Data-Comp2 (Data- 2 Compute Worker Nodes 2 Data Worker Nodes Compute Cluster 2) TaskTracker; 5 vCPUs; DataNode; 1 vCPUs; 4608MB vRAM; 50GB 2084MB vRAM; 100GB vDisk vDisk Data-Comp3 (Data- 3 Compute Worker Nodes 3 Data Worker Nodes Compute Cluster 3) TaskTracker; 3 vCPUs; DataNode; 1 vCPUs; 2664MB vRAM; 20GB 1948MB vRAM; 80GB vDisk vDisk Data-Comp4 (Data- 5 Compute Worker Nodes 1 Data Worker Nodes Compute Cluster 4) TaskTracker; 2 vCPUs; DataNode; 2 vCPUs; 2348MB vRAM; 20GB 2048MB vRAM; 200GB vDisk vDisk

Tab. 3.2.: Selected HiBench Workload Characteristics

Workload Data structure CPU usage IO (read) IO (write) WordCount unstructured high low low Enhanced DFSIO unstructured low high high

48 Chapter 3 Evaluation of Big Data Platforms and Benchmarks ponents (VMware vSphere, Big Data Extension and Serengeti Server) are installed and configured. In Phase 2, we setup the Apache Hadoop cluster using the Big Data Extension and Serengeti Server. We choose the cluster type, configure the number of nodes and set the virtualized resources as listed in Table 3.1. Finally, the cluster configuration is created and HiBench is installed in the client VM. Next in Phase 3, called Workload Prepare, are defined all workload parameters and is generated the test data. The generated data together with the defined parameters are then used as input to execute the workload in Phase 4. As already mentioned each experiment was repeated 3 times to ensure the representativeness of the results, which means that the data generation from Phase 2 and the Workload Execution (Phase 4) were run 3 consecutive times. Before each workload experiment in the Workload Prepare (Phase 3), the existing data is deleted and new one is generated.

Fig. 3.5.: Iterative Experimental Approach

In Phase 4, HiBench reports two types of results: Duration (in seconds) and Through- put (in MB per second). The Throughput is calculated by dividing the input data size through the Duration. These results are then analyzed in Phase 5, called Evaluation, and presented graphically in the next section of our report.

We call our approach iterative because after the each test run (completing Phases 2 to 5) the user can start at Phase 2, switching to a different HiBench workload, and continue performing new test runs on the same cluster configuration. However, to ensure consistent data state, a fresh copy of the input data has to be generated before each benchmark run. Similarly, in case of new cluster configuration all existing virtual nodes have to be deleted and replaced with new ones, using a basic virtual machine template. For all experiments, we ran exclusively only one cluster configuration at a time on the platform. In this way, we avoided biased results due to inconsistent system state.

3.1.5 Experimental Results

This section gives a brief overview of the WordCount and Enhanced DFSIO workloads. It also presents the results and analysis of the performed experiments. The results are provided in tables, which consist of multiple columns with the following data:

3.1 Performance Evaluation of Virtualized Hadoop Clusters 49 • Data Size (GB): size of the input data in gigabytes

• Time (Sec): workload execution duration time in seconds

• Data ∆ (%): difference of Data Size (GB) to a given data baseline in percent

• Time ∆ (%): difference of Time (Sec) to a given time baseline in percent

WordCount

WordCount [Hua+10a] is a CPU bound MapReduce job which calculates the num- ber of occurrences of each word in a text file. The input text data is generated by the RandomTextWriter program which is also part of the standard Hadoop distribu- tions.

Preparation: The workload takes three parameters listed in Table 3.3. The DATA- SIZE parameter is relevant only for the data generation.

Tab. 3.3.: WordCount Parameters

Parameter Description NUM_MAPS Number of map jobs per node NUM_REDS Number of reduce jobs per node Relevant for the data generator DATASIZE Input of (text) data size per node

In the case of a Data-Compute cluster, these parameters are only relevant for the Compute Workers (CWN running TaskTracker). Therefore, in order to achieve comparable results between the Standard and Data-Compute Hadoop cluster types, the overall sum of the processed data and number of map and reduce tasks should be the same. The total data size is equal to the DATASIZE multiplied by the number of CWN or WN. For example, to process 60GB data in Standard1 (3 WNs) cluster were configured 20GB input data size, 4 map and 1 reduce tasks, whereas in Data-Comp1 (2 CWNs & 1 DWN) cluster were configured 30GB input data size, 6 map and 1 reduce tasks. Similarly, we adjusted the input parameters for the remaining four clusters to ensure that the same amount of data was processed. We experimented with three different data sets (60, 120 and 180 GB), which compressed resulted in smaller sets (15.35, 30.7 and 46 GB).

Results and Evaluation The following subsections represent different viewpoints of the same experiments and hence are based on same numbers. In the first subsection we compare the performance between the Standard and Data-Computer cluster configurations. The second subsection evaluates how increasing the data size changes the performance for each cluster configuration.

Comparing Different Cluster Configurations

50 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Figure 3.6 depicts the WordCount completion times normalized for each input data size with respect to Standard1 as baseline. The lower values represent faster completion times, respectively the higher values account for longer completion times.

Fig. 3.6.: Normalized WordCount Completion Times

Tab. 3.4.: WordCount - Equal Number of VMs

Equal Number of 3 VMs 6 VMs 6 VMs VMs Data Size (GB) Diff. (%) Stan- Diff. (%) Stan- Diff. (%) Stan- dard1/ Data- dard2/ Data- dard2/ Data- Comp1 Comp3 Comp4 60 0 +22 +13 120 +2 +22 +14 180 +2 +23 +14

Tab. 3.5.: WordCount - Different Number of VMs

Different Number 3 VMs 3 VMs of VMs 6 VMs 6 VMs Data Size (GB) Diff. (%) Stan- Diff. (%) Stan- dard1/ Stan- dard1/ Data- dard2 Comp4 60 -19 -3 120 -18 -1 180 -17 -1

Table 3.4 compares cluster configurations utilizing the same number of VMs. In the first case, Standard1 (3 WNs) performs slightly (2%) better than Data-Comp1 (2 CWNs & 1 DWN). In the second and third case, Standard2 (6 WNs) is around 23% faster than Data-Comp3 (3 CWNs & 3 DWNs) and around 14% faster than Data- Comp4 (5 CWNs & 1 DWNs), making it the best choice for CPU bound applications.

In Table 3.5, comparing the configurations with different number of VMs, we observe that Standard2 (6 WNs) is between 17-19% faster than Standard1 (3 WNs), although Standard2 utilizes 6 VMs and Standard1 only 3 VMs. Similarly, Data-Comp4 (5 CWNs & 1 DWNs) achieves between 1-3% faster times than Standard1 (3 WNs).

3.1 Performance Evaluation of Virtualized Hadoop Clusters 51 In both cases having more VMs utilizes better the underlying hardware resources, which complies to the conclusions reported in [Bue13].

Another interesting observation, as seen in Figure 3.6, is that cluster Data-Comp1 (2 CWNs & 1 DWN) and Data-Comp2 (2 CWNs & 2 DWN) perform alike, although Data-Comp2 utilizes an additional instance of data worker node, which causes extra overhead on the hypervisor. However, as the WordCount workload is mostly CPU bound [Hua+10a], all the processing is performed on the compute worker nodes and the extra VM instance does not impact the actual performance. In the same time, if we compare the times on Figure 3.6 of all four Data-Compute cluster configurations, we observe that the Data-Comp4 (5 CWNs & 1 DWNs) performs best. This shows first that the allocation of virtualized resources influence the application performance and second that for CPU bound applications having more compute nodes is beneficial. Serengeti offers the ability for Compute Workers to use a Network File System (NFS) instead of virtual disk storage, also called TempFS in Serengeti. The goal is to ensure data locality, increase capacity and flexibility with minimal overhead. A detailed evaluation and experimental results of the approach are presented in the related work [Mag+13]. Using the TempFS storage type, we performed experiments with Data-Comp1 and Data-Comp4 cluster configurations. The results showed very slight around 1% improvement compared to the default shared virtual disk type that we used in all configurations.

Processing Different Data Sizes

Figure 3.7 depicts the WordCount processing times (in seconds) for the different data sizes of all the six cluster configurations. The shorter times indicate better performance and respectively the longer times indicate worse performance. Clearly, cluster configuration Standard2 (6 WNs) achieves the fastest times for all the three data sizes compared to the other configurations. This is also observed on Figure 3.8, which illustrates the throughputs (MBs per second) of all six configurations, where configuration Standard2 (6 WNs) achieves the highest throughput between 52-54 MBs per second.

Fig. 3.7.: WordCount Time (Seconds)

Table 3.6 and Table 3.7 summarize the processing times for all Standard and Data- Compute cluster configurations. Additionally, there is a column “Data ∆” repre- senting the data increase in percent compared to the baseline data size, which is 60GB. For example, the ∆ between the baseline (60GB) and 120GB is +100% and respectively for 180GB is +200%. Also, there are multiple columns “Time ∆”, one

52 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.8.: WordCount Throughput (MBs per second) per cluster configuration, which indicates the time difference in percent compared to the time of Standard1 (3 WNs), which we use as the baseline configuration. For example, comparing the times for processing 60GB of the baseline Standard1 (3 WNs) with Standard2 (6 WNs) configuration results in -15.72% time difference. This means that Standard2 finishes for 15.72% less time compared to the baseline Standard1. Similarly, the positive time differences indicate slower completion times in comparison to the baseline.

Tab. 3.6.: WordCount Standard Cluster Results

Data Size (GB) Data ∆ (%) Standard1 (Sec) Standard2 Time ∆ (%) Baseline (Sec) 60 baseline 1392.06 1173.29 -15.72 120 +100 2718.03 2304.94 -15.2 180 +200 4040.5 3442.03 -14.81

Tab. 3.7.: WordCount Data-Compute Cluster Results

Data Data ∆ Data Time Data Time Data Time Data Time Size (%) Comp1 ∆ Comp2 ∆ Comp3 ∆ Comp4 ∆ (GB) (Sec) (%) (Sec) (%) (Sec) (%) (Sec) (%) 60 baseline 1390.74 -0.09 1385.63 -0.46 1497.53 +7.58 1351.87 -2.89 120 +100 2767.91 +1.84 2752.92 +1.28 2963.72 +9.04 2684.3 -1.24 180 +200 4125.48 +2.1 4122.59 +2.03 4443.24 +9.97 4013.77 -0.66

Figure 3.9 depicts the time differences in percent of all cluster configurations nor- malized with respect to Standard1 (3 WNs) as baseline. We observe that Standard2 (6 WNs) and Data-Comp4 (5 CWNs & 1 DWNs) have negative time difference, which means that they perform faster than Standard1. On the other hand, Data-Comp1 (2 CWNs & 1 DWN), Data-Comp2 (2 CWNs & 2 DWN) and Data-Comp3 (3 CWNs & 3 DWNs) have positive time differences, which mean that they perform slower than Standard1.

Figure 3.10 illustrates how the different cluster configurations scale with the in- creasing data sets normalized to the baseline configuration. We observe that all configurations scale nearly linear with the increase of the data sizes. However, similar to Figure 8 we can clearly distinguish that Standard2 (6 WNs) is the fastest

3.1 Performance Evaluation of Virtualized Hadoop Clusters 53 Fig. 3.9.: WordCount Time Difference between Standard1 (Baseline) and all Other Configurations in %

configuration as its data points lie much lower than the other configurations. On the contrary, Data-Comp3 (3 CWNs & 3 DWNs) is the slowest configuration as its data points are the highest one for all three data sizes.

Fig. 3.10.: WordCount Data Scaling Behavior of all Cluster Configurations normalized to Standrad1

In summary, our experiments showed that for all cluster configurations the CPU bound WordCount workload scales nearly linear with the increase of the input data sets. Also we clearly observed that the Standard2 (6 WNs) configuration performs best, achieving the fastest times, whereas the Data-Comp3 (3 CWNs & 3 DWNs) performs worst, achieving the slowest completion time.

Enhanced DFSIO

TestDFSIO [18e] is a HDFS benchmark included in Hadoop distributions. It is designed to stress test the storage I/O (read and write) capabilities of a cluster. In this way performance bottlenecks in the network, hardware, OS or Hadoop setup

54 Chapter 3 Evaluation of Big Data Platforms and Benchmarks can be found and fixed. The benchmark consists of two parts: TestDFSIO-write and TestDFSIO-read. The write program starts multiple map tasks with each task writing a separate file in HDFS. The read program starts multiple map tasks with each task sequentially reading the previously written files and measuring the file size and the task execution time. The benchmark uses a single reduce task to measure and compute two performance metrics for each map task: Average I/O Rate and Throughput. Respectively, Equation 3.1 and Equation 3.2 illustrate how the two metrics are calculated with N as the total number of map tasks and the index i (0< i < N), identifying the individual tasks.

PN PN filesize(i) rate(1) i=1 time(i) Average I/O rate (N) = i=1 = (3.1) N N

PN filesize(i) = i=1 Throughput (N) PN (3.2) i=1 time(i)

Enhanced DFSIO is an extension of the DFSIO benchmark developed specifically for HiBench [Hua+10a]. The original TestDFSIO benchmark reports the average I/O rate and throughput for a single map task, which is not representative in cases when there are delayed or retried map tasks. Enhanced DFSIO addresses the problem by computing the aggregated I/O bandwidth. This is done by sampling the number of bytes read/written at fixed time intervals in the format of (map id, timestamp, total bytes read/written). Aggregating all sample points for each map tasks allows plotting the exact map task throughput as linearly interpolated curve. The curve consists of a warm-up phase and a cool-down phase, where the map tasks are started and shut down, respectively. In between is the steady phase, which is defined by a specified percentage (default is 50%, but can be configured) of map tasks. When the number of concurrent map tasks at a time slot is above the specified percentage, the slot is considered to be in the steady phase. The Enhanced DFSIO aggregated throughput metric is calculated by averaging value of each time slot in the steady phase.

Preparation:

The Enhanced DFSIO takes four input configuration parameters as described in Table 3.8. Tab. 3.8.: Enhanced DFSIO Parameters

Parameter Description RD_FILE_SIZE Size of a file to read in MB RD_NUM_OF_FILES Number of files to read WT_FILE_SIZE Size of a file to write in MB WT_NUM_OF_FILES Number of files to write

For the Enhanced DFSIO benchmark, the file sizes (parameters RD_FILE_SIZE and WT_FILE_SIZE), which the workload should read and write, were fixed to

3.1 Performance Evaluation of Virtualized Hadoop Clusters 55 100MB. In the same time, the number of files (parameters RD_NUM_OF_FILES and RD_NUM_OF_FILES) were fixed to be 100, 200 and 500 to operate on a data set with data sizes of 10, 20 and 50 GB. The total data size is the product of multiplying the specific file size with the number of files to be read/written. Three experiments were executed as listed in Table 3.9.

Tab. 3.9.: Enhanced DFSIO Experiments

Data RD_FILE_SIZE RD_NUM_OF_FILES WT_FILE_SIZE WT_NUM_OF_FILES Size (GB) 10 100 100 100 100 20 100 200 100 200 50 100 500 100 500

Results and Evaluation

The first subsection compares the performance between the Standard and Data- Computer cluster configurations. In the second subsection we compare and evaluate how increasing the size of processing data changes the performance for each cluster configuration. In both subsections the Enhanced DFSIO-read and Enhanced DFSIO- write parts are presented and discussed separately.

Comparing Different Cluster Configurations

Figure 3.11 depicts the normalized Enhanced DFSIO-read times, with Standard1 (3 WNs) achieving the best times for all test cases. Table 3.10 compares the cluster configurations utilizing the same number of VMs, whereas Table 3.11 compares configurations utilizing different number of VMs. In the first case, Standard1 (3 WNs) performs up to 73% better than Data-Comp1 (2 CWNs & 1 DWN) because of the different data placement strategies. In Data-Comp1 the data is stored on a single data node and should be read in parallel by the two compute nodes, which is not the case in Standard1, where each node stores the data locally, avoiding any communication conflicts. In the second case, Standard2 (6 WNs) performs between 18-46% slower than Data-Comp3 (3 CWNs & 3 DWNs). Although, each node in Standard2 stores a local copy of the data, it seems that the resources allocated per VM are not sufficient to run both TaskTracker and DataNode services, which is not the case in Data-Comp3.

Tab. 3.10.: DFSIO Read - Equal Number of VMs

Equal Number of 3 VMs 6 VMs VMs Data Size (GB) Diff. (%) Standard1/ Diff. (%) Standard2/ Data-Comp1 Data-Comp3 10 +68 -18 20 +71 -30 50 +73 -46

56 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.11.: Normalized DFSIO Read Completion Times

Tab. 3.11.: DFSIO Read - Different Number of VMs

Different Number 3 VMs 4 VMs of VMs 4 VMs 6 VMs Data Size (GB) Diff. (%) Data- Diff. (%) Data- Comp1/ Data-Comp2 Comp2/ Data-Comp3 10 -104 +3 20 -99 -15 50 -106 -39

In Table 3.11 we observe that Data-Comp2 (2 CWNs & 2 DWNs) completion times are two times faster than Data-Comp1 (2 CWNs & 1 DWN). On the other hand, Data-Comp3, which utilizes 3 data nodes, is up to 39% faster than Data-Comp2. This complies with our assumption that using more data nodes improves the read performance.

Fig. 3.12.: Normalized DFSIO Write Completion Times

Figure 3.12 illustrates the Enhanced DFSIO-write [Hua+10a] completion times for the five cluster configurations. Table 3.12 compares the cluster configurations utilizing the same number of VMs. In the first case, Standard1 (3 WNs) performs between 10-24% slower than Data-Comp1 (2 CWNs & 1 DWN). The reason for this is that Data-Comp1 utilizes only one data node and the HDFS pipeline writing process writes all three block copies locally on the node, which of course is against

3.1 Performance Evaluation of Virtualized Hadoop Clusters 57 Tab. 3.12.: DFSIO Write - Equal Number of VMs

Equal Number of 3 VMs 6 VMs VMs Data Size (GB) Diff. (%) Standard1/ Diff. (%) Standard2/ Data-Comp1 Data-Comp3 10 -10 +4 20 -21 -14 50 -24 -1

Tab. 3.13.: DFSIO Write - Different Number of VMs

Different Number 3 VMs 3 VMs of VMs 6 VMs 6 VMs Data Size (GB) Diff. (%) Data- Diff. (%) Standard1/ Comp1/ Data-Comp3 Data-Comp3 10 -4 -15 20 +13 -6 50 +19 -1

the fault tolerance practices in Hadoop. In a similar way, Data-Comp3 (3 CWNs & 3 DWNs) achieves between 1- 14% better times than Standard2 (6 WNs). Table 3.13 compares cluster configurations with different number of VMs. Data-Comp1 (2 CWNs & 1 DWN) achieves up to 19% better times than Data-Comp3 (3 CWNs & 3 DWNs), because of the extra cost of writing to 3 data nodes (enough to guarantee the minimum data fault tolerance) instead of only one data node. Further observations show that although Data-Comp3 utilizes 6VMs, it achieves up to 15% better times than Standard1, which utilizes only 3VMs. However, this difference decreases from 15% to 1% with the growing data sizes and may completely vanish for larger data sets.

Processing Different Data Sizes

Figure 3.13 shows the Enhanced DFSIO-read processing times (in seconds) for the different data sizes for the five tested cluster configurations. The shorter times indicate better performance and respectively the longer times indicate worse performance. Clearly, cluster configuration Standard1 (3 WNs) achieves the fastest times for all the three data sizes compared to the other configurations. This is also observed on Figure 3.14, which depicts the throughputs (MBs per second) for the five configurations, where configuration Standard1 (3 WNs) achieves the highest throughput between 143-147 MBs per second.

The following Table 3.14 and Table 3.15 summarize the processing times for the tested Standard and Data-Compute cluster configurations. Additionally, there is a column “Data ∆” representing the data increase in percent compared to the baseline data size, which is 10GB. For example, the ∆ between the baseline (10GB) and 20GB

58 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.13.: DFSIO Read Time (Seconds)

Fig. 3.14.: DFSIO Read Throughput (MBs per second) is +100% and respectively for 50GB is +400%. Also, there are multiple columns “Time ∆”, one per cluster configuration, which indicates the time difference in percent compared to the time of Standard1 (3 WNs), which we use as the baseline configuration. For example, comparing the times for processing 10GB of the baseline Standard1 (3 WNs) with Standard2 (6 WNs) configuration results in 83.15% time difference, as shown in Table 3.14. This means that Standard2 needs 83.15% more time compared to the baseline Standard1 to read the 10GB data. On the contrary, the negative time differences indicate faster completion times in comparison to the baseline.

Tab. 3.14.: DFSIO Standard Cluster Read Results

Data Size (GB) Data ∆ (%) Read Time (Sec) Standard1 Baseline Standard2 Time ∆ (%) 10 baseline 89 163 83.15 20 +100 157 303 92.99 50 +400 363 680 87.33

Figure 3.15 illustrate the time differences in percent of all the tested cluster con- figurations normalized with respect to Standard1 (3 WNs) as baseline. We observe that all time differences are positive, which means that they perform slower than the baseline configuration. Data-Comp3 (3 CWNs & 3 DWNs) configuration has the smallest time difference ranging between 28.10% and 55.06%, whereas the worst

3.1 Performance Evaluation of Virtualized Hadoop Clusters 59 Tab. 3.15.: DFSIO Data-Compute Cluster Read Results

Data Data ∆ Read Time (Sec) (%) Size Data- Time ∆ Data- Time ∆ Data- Time ∆ (GB) Comp1 (%) Comp2 (%) Comp3 (%) 10 baseline 274 207.87 134 50.56 138 55.06 20 +100 533 239.49 268 70.7 233 48.41 50 +400 1328 265.84 645 77.69 465 28.1

performing configuration is Data-Comp1 (2 CWNs & 1 DWN) with time differences between 207.87% and 265.84%.

Fig. 3.15.: DFSIO Read Time Difference between Standard1 (Baseline) and all Other Configurations in %

Figure 3.16 illustrates how the different cluster configurations scale with the in- creasing data sets normalized to the baseline configuration. We observe that all configurations scale almost linearly with the increase of the data sizes. Observing the graph we can clearly distinguish that Standard1 (3 WNs) is the fastest configuration as its line lies much lower than all other configurations. On the contrary, Data-Comp1 (2 CWNs & 1 DWN) is the slowest configuration as its data points are the highest one for all three data sizes.

Figure 3.17 shows the Enhanced DFSIO-write processing times (in seconds) for the different data sizes for the five tested cluster configurations. The shorter times indicate better performance and respectively the longer times indicate worse performance. If we look more closely at Figure 3.17, we can identify that cluster configuration Data-Comp1 (2 CWNs & 1 DWN) achieves the fastest times for 20GB and 50GB data sizes. This can be also observed on Figure 3.18, which depicts the throughputs (MBs per second) for the five configurations, where configuration Data-Comp1 (2 CWNs & 1 DWN) achieves the highest throughput around 67-68 MBs per second.

The following Table 3.16 and Table 3.17 summarize the processing times for the tested Standard and Data-Compute cluster configurations. Additionally, there is a column “Data ∆” representing the data increase in percent compared to the baseline data size, which is 10GB. For example, the ∆ between the baseline (10GB) and 20GB is +100% and respectively for 50GB is +400%. Also, there are multiple columns

60 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.16.: DFSIO Read Data Behavior of all Cluster Configurations normalized to Standrad1

Fig. 3.17.: DFSIO Write Time (Seconds)

Fig. 3.18.: DFSIO Write Throughput (MBs per second)

“Time ∆”, one per cluster configuration, which indicates the time difference in percent compared to the time of Standard1 (3 WNs), which we use as the baseline configuration. For example, comparing the times for processing 10GB of the baseline Standard1 (3 WNs) with Standard2 (6 WNs) configuration results in -15.93% time difference. This means that Standard2 finish for 15.93% less time compared to the baseline Standard1. On the contrary, the positive time differences indicate slower completion times in comparison to the baseline.

3.1 Performance Evaluation of Virtualized Hadoop Clusters 61 Tab. 3.16.: DFSIO Standard Cluster Write Results

Data Data ∆ (%) Write Time (Sec) Size (GB) Standard1 Standard2 Time ∆ Baseline (%) 10 baseline 226 190 -15.93 20 +100 372 400 +7.53 50 +400 953 952 -0.1

Tab. 3.17.: DFSIO Data-Compute Cluster Write Results

Data Data ∆ (%) Write Time (Sec) Size Data- Time ∆ Data- Time ∆ Data- Time ∆ (GB) Comp1 (%) Comp2 (%) Comp3 (%) 10 baseline 205 -9.29 165 -26.99 197 -12.83 20 +100 308 -17.2 319 -14.25 352 -5.38 50 +400 768 -19.41 903 -5.25 943 -1.05

Figure 3.19 depicts the time differences in percent of all the tested cluster configura- tions normalized with respect to Standard1 (3 WNs) as baseline. We observe that all time differences except for the 20GB experiment with Standard2 configuration are negative, which means that they perform faster than the baseline configuration. Data-Comp1 (2 CWNs & 1 DWN) and Data-Comp2 (2 CWNs & 2 DWN) achieve the highest time differences ranging between -5.25% and -26.99% making them the best performing cluster configurations.

Fig. 3.19.: DFSIO Write Time ∆ (%) between Standard1 (Baseline) and all other configurations in %

Figure 3.20 illustrates how the different cluster configurations scale with the in- creasing data sets normalized with respect to the baseline configuration. In this case, we observe that all configurations scale almost linearly with the increase of the data sizes, although the time differences are varying. Looking closely at the graphic we can distinguish that Standard1 (3 WNs) is the slowest configuration as its line lies slightly higher than most of the other configurations. On the contrary,

62 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Data-Comp1 (2 CWNs & 1 DWN) and Data-Comp2 (2 CWNs & 2 DWN) are the fastest configuration as their data points are the lowest one for all three data sizes.

Fig. 3.20.: DFSIO Write Data Behavior of all Cluster Configurations normalized to Standrad1

Overall, our experiments showed that the Enhanced DFSIO-read and Enhanced DFSIO- write workloads scale nearly linear with the increase of the data size. We observed that Standard1 (3 WNs) achieves the slowest times for DFSIO-read, whereas Data- Comp1 (2 CWNs & 1 DWN) achieves the fastest DFSIO-write times.

3.1.6 Lessons Learned

Our experiments showed:

• Compute-intensive (i.e. CPU bound WordCount) workloads are more suitable for Standard Hadoop clusters. However, we also observed that adding more compute nodes to a Data-Compute cluster improves the performance of CPU bound applications (see Table 3.4).

• Read-intensive (i.e. read I/O bound DFSIO) workloads perform best (Stan- dard1) when hosted on a Standard Hadoop cluster (see Table 3.10). However, adding more data nodes to a Data-Compute Hadoop cluster improved up to 39% the reading speed (e.g. Data-Comp2/Data-Comp3, see Table 3.11).

• Write-intensive (i.e. write I/O bound DFSIO) workloads were up to 15% faster (e.g. Standard2/Data-Comp3 and Standard1/Data-Comp3, see Table 3.12 and Table 3.13) on a Data-Compute Hadoop cluster in comparison to a Standard Hadoop cluster. Also our experiments showed that using less data nodes results in better write performance (e.g. Data-Comp1/Data-Comp3) on a Data-Compute Hadoop cluster, reducing the overhead of data transfer.

In addition it must be noted that Data-Compute cluster configurations are more advantageous in respect to node elasticity [Mag+13]. Therefore, the overhead for read- or compute-intensive workloads might be acceptable.

3.1 Performance Evaluation of Virtualized Hadoop Clusters 63 During the benchmarking process, we identified three important factors which should be taken into account when configuring a virtualized Hadoop cluster:

• Choosing the “right” cluster type (Standard or Data-Compute Hadoop cluster) that provides the best performance for the hosted Big Data workload is not a straightforward process. It requires very precise knowledge about the workload type, i.e. whether it is CPU intensive, I/O intensive or mixed, as indicated in Section 3.1.4.

• Determining the number of nodes for each node type (compute and data nodes) in a Data-Compute cluster is crucial for the performance and depends on the specific workload characteristics. The extra network overhead, caused by intensive data transfer between data and compute worker nodes, should be carefully considered, as also reported by Ye et al. [Ye+12].

• The overall number of virtual nodes running in a cluster configuration has direct influence on the workload performance, this is also confirmed by [Li+13b; Bue13; Ye+12]. Therefore, it is crucial to choose the optimal number of virtual nodes in a cluster, as each additional VM causes an extra overhead to the hypervisor. At the same time, we observed cases, e.g. Standard1/Standard2 and Data-Comp1/Data-Comp2, where clusters consisting of more VMs utilized better the underlying hardware resources.

Acknowledgements

This research is supported by the Big Data Lab at the Chair for Databases and Information Systems (DBIS) of the Goethe University Frankfurt. We would like to thank Alejandro Buchmann of Technical University Darmstadt, Nikolaos Korfiatis and Jeffrey Buell of VMware for their helpful comments and support.

64 Chapter 3 Evaluation of Big Data Platforms and Benchmarks 3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench

Abstract

In this paper, we evaluate the performance of DataStax Enterprise (DSE) using the HiBench benchmark suite and compare it with the corresponding Cloudera’s Distribution of Hadoop (CDH) results. Both systems, DSE and CDH were stress tested using CPU-bound (WordCount), I/O-bound (Enhanced DFSIO) and mixed (HiveBench) workloads. The experimental results showed that DSE is better than CDH in writing files, whereas CDH is better than DSE in reading files. Additionally, for DSE the read and write throughput difference is very minor, whereas for CDH the read throughput is much higher than the write throughput. The results we obtained show that the HiBench benchmark suite, developed specifically for Hadoop, can be successfully executed on top of the DataStax Enterprise (DSE). The paper is based on the following publications:

• Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle, Roberto V. Zicari, Benchmarking DataStax Enterprise/Cassandra with Hi- Bench, Frankfurt Big Data Lab, Technical Report No. 2014-2, arXiv:1411.4044, [Iva+14a].

• Todor Ivanov, Raik Niemann, Sead Izberovic, Marten Rosselli, Karsten Tolle, Roberto V. Zicari, Performance Evaluation of Enterprise Big Data Platforms with HiBench, in Proceedings of the 9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE 2015), August 20-22, 2015, Helsinki, Finland, [Iva+15a].

Keywords: big data; benchmarking; performance evaluation; Hadoop; Cassandra.

3.2.1 Introduction

The emergence of new Big Data applications opens up several challenges when managing Big Data (e.g. due to data volume, variety, velocity, veracity and more) [Aba+14]. The number and variety of Big Data technologies addressing these challenges is steadily growing. A wide spectrum of new architectural approaches such as NoSQL [Cat10] and MapReduce-based [SLF13] systems have emerged to cope with the many Big Data characteristics. Currently, Apache Hadoop [Whi12] has become the de facto standard in processing and storage of large data sets. Multiple vendors have adopted and integrated it, as a core part of their enterprise platforms (e.g. Cloudera, Hortonworks, IBM, Microsoft, DataStax, Teradata etc.).At the same time, there is scarcity of skilled people [Aba+14] such as platform architects, data engineers and data scientist, who need to have a broad knowledge of the data lifecycle and system complexity, and should have a deep understanding of each particular platform, which is hard to achieve in today’s dynamic Big Data ecosystem. One possible way to address this problem is to evaluate the systems by performing extensive experiments with widely used benchmarks. This paper is following this approach and investigates the performance of DataStax Enterprise (DSE) [Ent15]

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 65 through the use of standard Hadoop workloads. In particular, we ran experiments with CPU and I/O-bound micro-benchmarks as well as OLAP-style analytical query workloads.

The performed experiments showed that DSE is capable of successfully executing Hadoop applications without the need to adapt them for the underlying Cassandra distributed storage system [LM10]. On the contrary, because of the Cassandra File System (CFS) [Jak12], which is compatible with the Hadoop Distributed File System API, we were able to seamlessly run Hadoop stack applications on top of DSE. Our main contributions are:

• Defining a Benchmarking methodology for performance evaluation of Big Data platforms using the HiBench micro-benchmark suite [Hua+10b; Hua+10a].

• Adjusting HiBench to run on top of DataStax Enterprise.

• Evaluating the performance results of CPU-bound, I/O-bound and mixed workloads executed on both DataStax Enterprise and Cloudera’s Distribution of Hadoop.

• Applying different workloads than those of the Yahoo! Cloud Serving Bench- mark (see related work for details) to Cassandra in order to investigate how it handles Big Data and OLAP workloads with unstructured data.

This work is part of a series of benchmark experiments [Iva+14b; II15] conducted at the Frankfurt Big Data Lab. The paper is structured as follows: Section 3.2.2 gives some background information on DataStax Enterprise (DSE) and Cloudera’s Distribution of Hadoop. Section 3.2.3 gives an overview of related work. Section 3.2.4 describes the experimental setup and our benchmarking methodology. Section 3.2.5 presents and evaluates the experimental results. Finally, Section 3.2.6 presents the lessons learned.

3.2.2 Background

Cassandra

Apache Cassandra [LM10] is a widely used NoSQL storage system. It has a peer-to- peer distributed ring architecture which can scale to thousands of nodes, communi- cating with each other over gossip protocol. This makes it capable of storing large data sets, replicated between multiple nodes, with no single point of failure. Cassan- dra is a key-value store that supports very simple data model with dynamic control over data layout and format [LM10]. The key is an index in a multi-dimensional map, which represents a Cassandra table, and the value is structured in a column family object. Cassandra has a flexible schema and comes with its own query language called Cassandra Query Language (CQL) [CQL15].

66 Chapter 3 Evaluation of Big Data Platforms and Benchmarks DataStax Enterprise (DSE)

DataStax Enterprise (DSE) [Ent15] includes the production certified version of Apache Cassandra with extended features such as in-memory computing capabilities, advanced security, automatic management services as well as analytical and enter- prise search on top of the distributed data. DSE also includes the OpsCenter [Ent15] tool, provided for visual management and monitoring of DSE clusters. The Cassandra File System (CFS) [Jak12] is a HDFS compatible file system that was built on top of Cassandra to enable running Hadoop applications, without any modification, in DSE. CFS is implemented as a keyspace with two column families. The inode column family replaces the HDFS NameNode daemon that tracks each file metadata and block locations. The HDFS DataNode daemon is replaced by the sblocks column family that stores the file blocks with actual data. By doing this, the HDFS services are entirely substituted by CFS, removing the single point of failure in the Hadoop NameNode and providing Cassandra with support for large files.

Cloudera Hadoop Distribution (CDH)

Cloudera’s Distribution of Hadoop (CDH) [CDH15] is 100% Apache-licensed open source Hadoop distribution offered by Cloudera. It includes the core Apache Hadoop [Whi12] elements - Hadoop Distributed File System (HDFS) and MapReduce (YARN), as well as several additional projects from the Apache Hadoop Ecosystem. All components are tightly integrated to enable ease of use and managed by a central application - Cloudera Manager [CDH15].

3.2.3 Related Work

All related work we are aware of stressed Cassandra with workloads based on the Yahoo! Cloud Serving Benchmark (YCSB). YCSB [Coo+10; Pat+11] was developed to compare emerging cloud serving systems like Cassandra, HBase, MongoDB, Riak and many more, which do not support ACID transactions. YCSB provides a core package of 5 pre-defined workloads A-E, which simulate a cloud OLTP application.

[Rab+12b; Coo+10; Pat+11; KKR14] investigate the performance of Cassandra by stress testing it with the YCSB benchmark [Coo+10; Pat+11] and use YCSB to compare the reading and writing latencies of the systems as well as their scalability and elasticity capabilities.

Kuhlenkamp et al. [KKR14] classify the scalability and elasticity approaches con- ducted in multiple related works and focus on the performance measurements on Cassandra and HBase. It is important to note that HBase is a distributed key-value store modeled after the Google’s BigTable and running on top of Hadoop and HDFS. The authors of [KKR14] reproduce the experiments of Rabl et al. [Rab+12b] on a virtualized cloud infrastructure in Amazon EC2. Their results confirmed that both Cassandra and HBase scale near linearly and Cassandra is better than HBase in terms of read performance, whereas HBase is better than Cassandra in terms of write performance.

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 67 Another study by Dede et al. [Ded+13] evaluates the performance of Cassandra in conjunction with Hadoop, but again using YCSB. The authors compare Hadoop- native, Hadoop-Cassandra-FS (reads the input from Cassandra, processes the data in Hadoop and writes the output to a file system shared by the workers) and Hadoop-Cassandra-Cassandra (reads input from Cassandra, processes the data in Hadoop and writes output back to Cassandra). The results show that for CPU- and Memory-intensive loads Hadoop-native performs better than Cassandra, whereas for write-heavy Hadoop-Cassandra-FS and Hadoop-Cassandra-Cassandra perform better.

3.2.4 Setup and Configuration

Hardware and software

The experiments presented in this paper were performed using a Fujitsu BX 620 S3 blade center. The blade center’s nodes were uniformly set up with hard- and software characteristics listed in Table 3.18 and Table 3.19. . On every used b lade center node, the two available hard disks were combined into a RAID-0 native disk array and mounted as one logical volume resulting in about 280 GB of storage space. All unnecessary services of the operating system were turned off.

Tab. 3.18.: Hardware Characteristics of the used Blade Nodes

CPU 2x Dual-core AMD Opteron 870 (2.0 GHz) Main memory 16 GB DDR2 registered Mass memory 2x Seagate ST3146854SS, 146 GB Network adapter Broadcom NetExtreme BCM5704S,1 GBit/s transfer speed

Tab. 3.19.: Software Characteristics of the used Blade Nodes

Operating system Ubuntu Server 12.04 LTS 64 bit Java runtime environment Oracle JRE 1.7.0.60-b19 Cassandra cluster DataStax Enterprise (DSE) 4.0.2 Hadoop cluster Cloudera Hadoop Distribution (CDH) 5.0.2 Benchmark Suite Intel HiBench version 2.2

Both DataStax Enterprise version 4.0.2 and Cloudera Hadoop Distribution version 5.0.2 were used in the experiments. In both cases, the latest stable releases were obtained from the platform vendors at the time of setting up the environment. Our goal was to keep the default system parameters, which are automatically set by the platform during the installation process.

DSE was configured and administered using the integrated OpsCenter [Ent15] tool. It takes as parameter the IP addresses of all nodes and automatically performs the cluster setup process by installing the necessary DSE packages and services. It is possible to add new nodes by specifying a pre-generated token value. This value determines the range of the dataset keys that this node will hold. The key

68 Chapter 3 Evaluation of Big Data Platforms and Benchmarks distribution among all incorporated nodes was rebalanced. Additionally, we changed the replication factor to 3, so that it matches the one of CDH and guarantees data reliability and fast data access. Similar to DSE, CDH was installed and managed using the Cloudera Manager [CDH15], which offers tight integration with all Hadoop services. More details on the experimental environment can be found in our technical report [Iva+14a].

Benchmarking Methodology

The benchmark methodology for our experiments was designed using the HiBench benchmark suite [Hua+10c; Hua+10b], developed by Intel to stress test Hadoop systems. It consists of 10 different workloads: Micro Benchmarks (Sort, WordCount, TeraSort, Enhanced DFSIO), Web Search (Nutch Indexing, PageRank), Machine Learning (Bayesian Classification, K-means Clustering) and Analytical Queries (Hive Join, Hive Aggregation). Three benchmarks were used in our tests with the goal to have representative workloads for CPU-bound (WordCount), I/O-bound (Enhanced DFSIO) and mixed CPU and I/O-bound (HiveBench) workloads. In order to run the workloads on top of DataStax Enterprise, we had to slightly modify the shell scripts provided in the benchmark suite. The modified HiBench code is available on github [Lab15].

Fig. 3.21.: Benchmarking Methodology Process Diagram

To ensure accurate performance measurements, each experiment was repeated 3 times and the average value was taken as representative result. Additionally, we report the standard deviation between the measured values in order to prove that the 3 repetitions yield a representative value. Another rule that we followed during the experiments was to leave approximately 25% of the total storage space that was assigned for each test cluster to be used for temporary data. In other words, both cluster setups were instrumented to use 1476 GB of the maximum of 8 × 246 GB = 1968 GB.

Figure 3.21 briefly illustrates the different phases in our benchmarking methodology. In the initial Phase 1, all software components (OS, Java, DSE, CDH and HiBench)

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 69 are installed and configured. Next in Phase 2, called Workload Prepare, are defined all workload parameters and is generated the test data. The generated data together with the defined parameters are then used as input to execute the workload in Phase 3. Each experiment was repeated 3 times to ensure the representativeness of the results, which means that the data generation from Phase 2 and the Workload Execution (Phase 3) were run 3 consecutive times. Before each workload experiment in the Workload Prepare (Phase 2), the existing data is deleted and new one is generated. In Phase 3, HiBench reports two types of results: Duration (in seconds) and Throughput (in MB per second). The Throughput is calculated by dividing the input data size through the Duration.These results are then analyzed in Phase 4, called Evaluation, and presented graphically in the next section. DSE and CDH present several differences, namely:

• DSE 4.0.2 uses Apache Cassandra 2.0.6 as a storage engine, which offers HDFS compatible interface through the Cassandra File System, whereas CDH 5.0.2 uses Apache HDFS 2.3.0.

• DSE 4.0.2 build-in version of MapReduce [Doc15] is based on Apache Hadoop version 1.0.4, which is much older than the MapReduce version 2.3.0 (called YARN) that is included in CDH 5.0.2. The new design of YARN [Vav+13] offers multiple improvements compared to the older model in terms of scalability, multi-tenancy, serviceability, security, reliability, flexible resource model and backward compatibility, to name a few.

According to DataStax [Dat15], it is possible to integrate a newer version of MapRe- duce, but no guidance on how to do it was provided at the time of writing this paper. Also our goal was to install and test the platforms as black boxes, like the majority of users do.

3.2.5 Experimental Results

This section presents the experiments together with our evaluation. The results are shown in tables (Table 3.21, Table 3.22, Table 3.23, Table 3.26 and Table 3.27) with the most relevant table columns described below:

• Data size: size of the workload’s input data in GB

• Time: average time of three workload executions in seconds

• σ: standard deviation of the three execution times in seconds

• Data ∆: difference of the data size to the one of the baseline in percent

• Time ∆: difference of the execution time to the one of the baseline in percent

• CDH/DSE Time ∆: difference of the execution times of DSE and CDH plat- forms in percent

70 Chapter 3 Evaluation of Big Data Platforms and Benchmarks WordCount

WordCount is a MapReduce program which calculates the number of occurrences of each word in a text file. The input text data is generated by the RandomTextWriter program, which takes as a parameter the size of the text file to be generated. Two additional parameters are relevant: the number of mappers and the number of reducers.

Prior to performing the real experiments, we investigated the rules of defining an optimal values for the mappers and reducers for a particular cluster configuration. The best practice [Sam12] is to configure 1.5 mapper and reducer tasks for each physical CPU core. In our setting, we have 4 physical CPU cores, which results in 6 tasks in total. In addition, the rule of thumb [Sam12] states that roughly two thirds of the slots should be allocated to map tasks and the remaining one third as reduce tasks. In our case, we have chosen 4 map and 2 reduce tasks. Therefore, the input data size was fixed to 30 GB per node and the number of mappers and reducers was chosen following the best practices. Four different configurations were tested on the DSE cluster for a total data size of 240 GB. The times (in seconds) for each configuration are shown in Table 3.20. We can clearly observe that the configuration with 4 map and 2 reduce tasks achieves the best time and confirms the best practices. The throughput for all of the experiments is almost identical, which can be explained with the fact that the WordCount workload is very CPU intensive, with light disk and network usage. Therefore, all of our further WordCount experiments were configured with 4 map and 2 reduce tasks.

Having defined the number of mappers and reducers, we proceeded with the ex- ecution of the WordCount experiments. Tests were performed for three different input data sizes 240, 340 and 440 GB with both the DataStax Enterprise and the Cloudera Hadoop Distribution platforms. Figure 3.22 depicts the time (in seconds) and throughput (in MB per second) for the different data sets and systems. As expected, for both platforms increasing the data size results in a longer processing time. However, the throughput of both platforms remains constant because the workload is heavily CPU-bound.

Tab. 3.20.: WordCount Map/Reduce Experiments (240 GB)

Number of Number of Time (sec- Throughput reducers mappers onds) (MB/Second) 1 3 4149 59.23 2 4 4068.22 60.42 2 6 4097.79 59.97 4 12 4109.12 59.81

Another interesting observation is that for all three data sets CDH achieves on average 17% to 18% (column CDH/DSE Time ∆ in Table IV) faster execution times than DSE. Similarly, the throughput of CDH is around 20% to 22% higher than the one of DSE. There could be several reasons for this. One reason can be the improved parallel processing capabilities of the newer YARN (MapReduce 2.0) included in

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 71 CDH. Another reason can be the better integration between YARN and HDFS in CDH compared to the one of MapReduce and CFS in DSE.

Fig. 3.22.: WordCount – Processing Different Data Sizes

Table 3.21 summarizes all experimental results of the workload. It is interesting to observe that increasing the data size with 100 GB (+42% more) increases the workload execution time with around 42% for DSE (column DSE Time ∆) and 39% for CDH (column CDH Time ∆). Respectively, increasing the data from 240 to 440 GB (+83% more) increases the processing time with around 84% for DSE and around 81% for CDH. This implies that both platforms scale nearly linear. Figure 3.23 illustrates this, with DSE matching the exact linear scaling line in green and CDH slightly below the stepwise linear scaling line.

Tab. 3.21.: WordCount Results

Data Data ∆ DSE DSE σ DSE CDH CDH σ CDH CDH/ Size (%) Time (Sec) Time ∆ Time (Sec) Time ∆ DSE (GB) (Sec) (%) (Sec) (%) Time ∆ (%) 240 baseline 4068.22 57.18 baseline 3392.21 17.46 baseline -16.62 340 41.67 5785.47 17.36 42.21 4726.6 9.04 39.34 -18.3 440 83.33 7471.23 34.55 83.65 6126.84 36.46 80.62 -17.99

In summary, our experiments showed that DSE and CDH are capable of running compute-intensive MapReduce applications, achieving stepwise linear performance with the growing data sizes.

72 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.23.: WordCount – Data Scaling Behavior

Enhanced DFSIO

TestDFSIO [Whi12] is a HDFS benchmark included in all major Hadoop distributions. It is designed to stress test the storage I/O (read and write) capabilities of a Hadoop cluster. In this way performance bottlenecks in the network, hardware, operating system or cluster setup can be found and fixed. The Enhanced DFSIO is an extension of the TestDFSIO benchmark developed specifically for HiBench. The benchmark consists of two parts: TestDFSIO-write and TestDFSIO-read. The Enhanced DFSIO benchmark takes four input configuration parameters. The first two define the number of files to be read or written. The other two parameters define the size of those files. The data size is the product of multiplying the file size with the number of files to be read or written. To process data sizes of 240 GB, 340 GB and 440 GB, the parameters for the file sizes were fixed to 400 MB and the parameters for the number of files to read and write were set to 615, 871 and 1127. The file size and number of files were chosen based on a results presented in related work [Del13; Isl+12b]. The experiments for the three data sizes were executed on both DSE and CDH systems. Note that the Enhanced DFSIO benchmark runs the write and read part consecutively, which means that the files written by TestDFSIO-write are read by TestDFSIO-read.

Figure 3.24 depicts the read time (in seconds) and throughput (in MB per second) for both systems. The lower values in the Time graphic represent faster completion times, respectively the higher values in the Throughput graphic account for better performance. Similar to the WordCount, we observed that increasing the data size for both systems resulted in longer reading times. It is also interesting to see that DSE takes between 14% and 32% (column CDH/DSE Time ∆ in Table 3.22) more time to read the data compared to CDH. In the same way, the throughput of the DSE is between 15% and 47% lower than CDH. As the workload is 100% I/O-bound the reason for this discrepancy should lie in the different file system design of both platforms. The first obvious difference is that HDFS has a default block size

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 73 Fig. 3.24.: Enhanced DFSIO – Reading Different Data Sizes

(dfs.blocksize) of 128 MB, whereas the CFS default block size (fs.local.block.size) is 64 MB. The second important fact is that HDFS splits each file into multiple blocks depending on the size and writes or reads them sequentially [Shv+10]. On the contrary, CFS first splits each file into blocks and then each block is further split into sub-blocks and written as a column value [Ent15]. The default sub-block size (fs.local.subblock.size) is 2 MB.

Another noticeable trend on Figure 3.24 is that the DSE read throughput declines with the increase of the data size. Similar behavior was described by Kuhlenkamp et al. [KKR14]. They observed that increasing the data load per node, decreases significantly the read performance, because of the many disk-resident lookups.

Figure 3.25 illustrates the write time (in seconds) and throughput (in MB per second) for both systems. In contrast to the read experiments, the write part of the workload shows that DSE is between 54% and 81% (column CDH/DSE Time ∆ in Table 3.23) faster in writing than CDH. Similarly DSE achieves between 35% and 45% higher throughput compared to the CDH platform.

The reason for the better writing performance lies in the architecture of Cassandra. It morphs all writes to disk into sequential writes [LM10] in append fashion, which ensures the maximum write throughput. In addition to that, an in-memory structure and index are maintained for fast data access. A process is executed periodically to compact all related data files into one big file. In the case of HDFS, a file is considered written only after all of its three (default replication factor) copies are safely written and acknowledged. This is done by a pipeline process which streams the data between the nodes in order to achieve better throughput [Shv+10]. The

74 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.25.: Enhanced DFSIO – Writing Different Data Sizes

Tab. 3.22.: Enhanced DFSIO Read Results

Data Data ∆ DSE DSE σ DSE CDH CDH σ CDH CDH/ Size (%) Time (Sec) Time ∆ Time (Sec) Time ∆ DSE (GB) (Sec) (%) (Sec) (%) Time ∆ (%) 240 baseline 915.61 95.18 baseline 790.8 13.28 baseline -13.63 340 41.67 1405.67 20.16 53.52 1085.78 21.51 37.3 -22.76 440 83.33 2050.84 147.22 123.99 1386.55 15.85 75.33 -32.39 process is much more complex and time consuming than the one implemented in Cassandra.

Table 3.22 summarizes the experimental results for the benchmark’s read part. It is interesting to mention that increasing the data size with 100 GB (+42% more) increases the reading time with around 54% for DSE (column DSE time ∆) and around 37% for CDH (column CDH time ∆). Respectively, increasing the data size from 240 to 440 GB (+83% more) increases the reading time with 124% for DSE and around 75% for CDH. The data scaling behavior of both platforms is visualized on Figure 3.26.

Likewise, Table 3.23 summarizes all results of the write experiments. Interestingly, as shown on Figure 3.26, the write Data ∆ to Time ∆ (Table 3.23) relation for both platforms is very similar to the one of the read Data ∆ to Time ∆ (Table 3.22) relation. Increasing the data size with 100 GB (+42%) increases the write time with 52% for DSE (column DSE time ∆) and around 41% for CDH (column CDH time ∆). Respectively, increasing the data size from 240 to 440 GB (+83% more) increases the writing time with around 117% for DSE and around 85% for CDH.

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 75 Tab. 3.23.: Enhanced DFSIO Write Results

Data Data ∆ DSE DSE σ DSE CDH CDH σ CDH CDH/ Size (%) Time (Sec) Time ∆ Time (Sec) Time ∆ DSE (GB) (Sec) (%) (Sec) (%) Time ∆ (%) 240 baseline 973.9 75 baseline 1760.24 4.39 baseline 80.74 340 41.67 1477.35 135.74 51.69 2490.22 31.33 41.47 68.56 440 83.33 2110.06 121.88 116.66 3247.75 28.5 84.51 53.92

Fig. 3.26.: Enhanced DFSIO – Data Scaling Behavior

Our next goal was to compare the difference between read and write execution times and throughput for both systems. Table 3.24 reports the changes in the DSE and CDH read to write ratios for each data set.

Tab. 3.24.: Enhanced DFSIO Read/Write ∆

Data Size (GB) DSE Read/Write ∆ (%) CDH Read/Write ∆ (%) 240 6.37 122.59 340 5.1 129.35 440 2.89 134.23

For the DSE platform, we can observe that the difference between reading and writing times is very small (column DSE Read/Write ∆). Comparing the times for 240 GB, we see that the difference is around 6% and gradually decreases to around 3% for 440 GB data size. In a related work, Dede et al. [Ded+13] report similar results in their experiments with 8 node Cassandra cluster, where reading has drastically improved when increasing the number of nodes. In the same time,

76 Chapter 3 Evaluation of Big Data Platforms and Benchmarks increasing the data size from 16 to 32 million records has also improved the reading performance, thus decreasing the gap between reading and writing.

In contrast, the difference between the reading and writing times for the CDH platform is much higher. For all tested data sizes the writing times are at least 2.2 times, both in time and throughput, slower than the reading times (column CDH Read/Write ∆) and the difference grows further with the increase of the data size. Similar behavior is observed in the results presented by Nicholas Wakou [Del13] (around 2.5 times slower writing times) and Islam et al. [Isl+12b].

In summary, our experiments showed that both DSE and CDH platforms provide good read and write capabilities while increasing the data sets. For read-intensive workloads CDH performs better, whereas for write-intensive workloads DSE is much faster.

HiveBench

The OLAP-style analytical queries, called HiveBench, are adapted from the Pavlo et al. [Pav+09b] and have the goal to test the performance of Hive [Thu+10], running on top of MapReduce. The workloads consists of two queries (Join and Aggregation), which are implemented using two tables Rankings (default size of 1 GB) and UserVisits (default size of 20 GB). The first query joins the two tables and stores the result into a temporary table. The second query aggregates an attribute from the UserVisits table.

The data generator of HiveBench takes as input four parameters. The first two are the number of map and reduce tasks. The other two parameters set the number of rows for the two tables (Rankings and UserVisits), which defines the size of each table. Table 3.25 summarizes the input parameters for the three tested data sizes (240, 340 and 440 GB).

Tab. 3.25.: HiveBench Data Generation Parameters

Data Size (GB) PAGES USERVISITS Number Number of of maps reducers 240 132 000 000 1 100 000 000 220 110 340 204 000 000 1 700 000 000 340 170 440 264 000 000 2 200 000 000 440 220

Tab. 3.26.: HiveBench Aggregation Results

Data Size Data ∆ (%) Processed DSE Time DSE σ DSE Time (GB) Data(GB) (Sec) (Sec) ∆ (%) 240 baseline 180.2 1837.43 5.52 baseline 340 41.67 278.5 2699.84 10.81 46.94 440 83.33 360.4 3471.84 20.24 88.95

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 77 Table 3.26 summarizes the results of the hive-aggregation query for DSE. Unfortu- nately, the CDH results obtained from HiBench were not accurate and comparable with DSE. The reason for this was that at the time when the experiments were done, the Hadoop commands used to gather the results were deprecated and not supported in the YARN version of MapReduce. For DSE, we observed near liner increase of the hive-aggregation times with the increase of the data sets, as illustrated on Figure 3.28. In particular, increasing the data size with 100 GB (+42% more) increases the hive-aggregation time with around 47% (column DSE Time ∆). Similarly, expanding the data size from 240 to 340 GB (+83% more) increased the processing time around 89%.

Figure 3.27 depicts the times (in seconds) and throughput (in MB per second) of the hive-join query for both DSE and CDH systems. The lower values in the Time graphic represent faster completion times, respectively the higher values in the Throughput graphic account for better performance. It is interesting to observe that for the first case (240 GB), DSE is around 6% (column CDH/DSE Time ∆ in Table 3.27) faster than CDH, both in time and throughput. However, DSE is 4% slower than CDH for 340 GB and respectively 7% slower than CDH for 440 GB, both in time and throughput. The main reason for this result is that the hive-join query is CPU-intensive and reads the entire UserVisits and Rankings tables.

As observed in the WordCount and Enhanced DFSIO experiments, CDH performs better for CPU-bound and Read-intensive workloads, which are also the bigger part of the mixed hive-join workload. Table 3.27 summarizes the HiveBench join results for both platforms. We can observe that increasing the data size with 100 GB (+42% more) increases the hive-join time for DSE with around 42% (column DSE Time ∆) and around 28% for CDH (column CDH Time ∆). Respectively, increasing the data size from 240 to 440 GB (+83% more) increased the processing time with 78% for DSE and 56% for CDH. The numbers show that DSE scales almost linearly, within 5% range, with the increase of the data sets, as shown on Figure 3.28. On the contrary CDH performance lies clearly below the data scaling line (Data ∆).

In general, our experiments showed that analytical queries can be successfully run on top of DSE. It performs slightly slower (around 6%) than CDH for the hive-join workload.

Tab. 3.27.: HiveBench Join Results

Data Data ∆ ProcessedDSE DSE σ DSE CDH CDH CDH CDH/ Size (%) Data Time (Sec) Time ∆ Time σ Time ∆ DSE (GB) (GB) (Sec) (%) (Sec) (Sec) (%) Time ∆ (%) 240 baseline 188.68 1486.23 25.56 baseline 1580.35 14.17 baseline 6.33 340 +41.67 291.6 2106.86 10.75 41.76 2027.28 9.49 28.28 -3.78 440 +83.33 377.35 2646.29 3.01 78.05 2467.45 5.39 56.13 -6.76

78 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.27.: HiveBench Join for Different Data Sizes

Fig. 3.28.: HiveBench – Data Scaling Behavior

3.2.6 Lessons Learned

In this paper we presented results showing that the HiBench benchmark suite, devel- oped specifically for Hadoop, can be successfully executed on top of the DataStax Enterprise (DSE). Our experiments stress tested both DSE and CDH platforms by performing multiple runs with CPU-bound, I/O-bound and analytic MapReduce workloads. Our results showed:

3.2 Performance Evaluation of Enterprise Big Data Platforms with HiBench 79 • For CPU-intensive workloads (WordCount) both DSE and CDH scale nearly linear with the increase of the data size, but DSE performs up to 18% slower than CDH.

• For read-intensive workloads (Enhanced DFSIO-read) DSE performs up to 32% slower than CDH.

• For write-intensive workloads (Enhanced DFSIO-write) DSE achieves up to 81% faster times than CDH.

• For DSE the read and write throughput difference ranges slightly between 2% and 6%, whereas for CDH the read throughput is at least 2.2 times faster than the write throughput.

• DSE was successfully tested with HiveBench (join and aggregation queries), representing mixed (CPU and I/O-bound) analytical workload, for 240 to 440 GB data sets.

• With respect to HiveBench-join DSE is around 6% (both in time and through- put) faster than CDH for 240 GB data set. However, for 340 and 440 GB, DSE is around 4% to 7% slower than CDH, both in time and throughput.

Our results also confirmed the results of related work that is based on workloads of the YCSB.

Acknowledgements

This research is supported by the Frankfurt Big Data Lab (Chair for Databases and Information Systems - DBIS) at the Goethe University Frankfurt, the Institute for Information Systems (IISYS) at Hof University of Applied Sciences and Accenture Germany.

80 Chapter 3 Evaluation of Big Data Platforms and Benchmarks 3.3 Evaluating Hadoop Clusters with TPCx-HS

Abstract

In the era of Big Data, with increasingly growing data sizes, system complexity and variety of components, it is very hard to evaluate and compare the functionalities of the existing platforms. There is an urgent need for a standard Big Data benchmark. The newly introduced TPCx-HS benchmark tries to fulfil this needs by offering a complete test kit, which simulates heavily network and I/O-intensive scenarios. In this paper we evaluate the benchmark kit by comparing two Hadoop cluster setups using shared and dedicated networks. As expected, the analysis of our results showed that the Hadoop cluster using dedicated network is multiple times faster than the same setup using shared network. Additionally, based on the TPCx-HS price-performance metric, we were able to relate the performance gains to the extra costs for the dedicated network. Overall, the TPCx-HS benchmark can be successfully used to measure the influence of the network setup on the Hadoop cluster performance. The paper is based on the following publication:

• Todor Ivanov and Sead Izberovic, Evaluating Hadoop Clusters with TPCx- HS, Frankfurt Big Data Lab, Technical Report No.2015-1, arXiv:1509.03486, [II15].

3.3.1 Introduction

The growing complexity and variety of Big Data platforms makes it both difficult and time consuming for all system users to properly setup and operate the systems. Another challenge is to compare the platforms in order to choose the most appro- priate one for a particular application. All these factors motivate the need for a standardized Big Data benchmark that can help the users in the process of platform evaluation. Just recently TPCx-HS [TPCb; Nam+14] has been released as the first standardized Big Data benchmark designed to stress test a Hadoop cluster.

The goal of this study is to evaluate and compare how the network setup influences the performance of a Hadoop cluster. In particular, experiments were performed using shared and dedicated 1Gbit networks utilized by the same Cloudera Hadoop Distribution (CDH) cluster setup. The TPCx-HS benchmark, which is very network intensive, was used to stress test and compare both cluster setups. All the presented results are obtained by using the officially available version [TPCb] of the benchmark, but they are not comparable with the officially reported results and are meant as an experimental evaluation, not audited by any external organization. As expected the dedicated 1Gbit network setup performed much faster than the shared 1Gbit setup. However, what was surprising is the negligible price difference between both cluster setups, which pays off with a multifold performance return.

The rest of the report is structured as follows: Section 3.3.2 provides a brief descrip- tion of the technologies involved in our study. An overview of the hardware and software setup used for the experiments is given in Section 3.3.3. Brief summary of the TPCx-HS benchmark is presented in Section 3.3.4. The performed experiments

3.3 Evaluating Hadoop Clusters with TPCx-HS 81 together with the evaluation of the results are presented in Section 3.3.5. Section 3.3.6 depicts the resource utilization of the cluster during the benchmark execution. Finally, Section 3.3.7 concludes with lessons learned.

3.3.2 Background

Big Data has emerged as a new term not only in IT, but also in numerous other industries such as healthcare, manufacturing, transportation, retail and public sector administration [Man+11; Jag+14] where it quickly became relevant. There is still no single definition which adequately describes all Big Data aspects [Hu+14], but the “V” characteristics (Volume, Variety, Velocity, Veracity and more) are among the widely used one. Exactly these new Big Data characteristics challenge the capabilities of the traditional data management and analytical systems [Hu+14; IKZ13]. These challenges also motivate the researchers and industry to develop new types of systems such as Hadoop and NoSQL databases [Cat10].

Apache Hadoop [18d] is a software framework for distributed storing and pro- cessing of large data sets across clusters of computers using the map and reduce programming model. The architecture allows scaling up from a single server to thou- sands of machines. At the same time Hadoop delivers high-availability by detecting and handling failures at the application layer. The use of data replication guarantees the data reliability and fast access. The core Hadoop components are the Hadoop Distributed File System (HDFS) [Bor07; Shv+10] and the MapReduce framework [DG08]. HDFS has a master/slave architecture with a NameNode as a master and multiple DataNodes as slaves. The NameNode is responsible for the storing and managing of all file structures, metadata, transactional operations and logs of the file system. The DataNodes store the actual data in the form of files. Each file is split into blocks of a preconfigured size. Every block is copied and stored on multiple DataNodes. The number of block copies depends on the Replication Factor.

MapReduce is a software framework that provides general programming interfaces for writing applications that process vast amounts of data in parallel, using a dis- tributed file system, running on the cluster nodes. The MapReduce unit of work is called job and consists of input data and a MapReduce program. Each job is divided into map and reduce tasks. The map task takes a split, which is a part of the input data, and processes it according to the user-defined map function from the MapReduce program. The reduce task gathers the output data of the map tasks and merges them according to the user-defined reduce function. The number of reducers is specified by the user and does not depend on input splits or number of map tasks. The parallel application execution is achieved by running map tasks on each node to process the local data and then send the result to a reduce task which produces the final output.

Hadoop implements the MapReduce model by using two types of processes – Job- Tracker and TaskTracker. The JobTracker coordinates all jobs in Hadoop and sched- ules tasks to the TaskTrackers on every cluster node. The TaskTracker runs tasks assigned by the JobTracker. Multiple other applications were developed on top of the Hadoop core components, also known as the Hadoop ecosystem, to make it more ease to use and applicable to variety of industries. Example for such applications are

82 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Hive [Thu+10], Pig [Gat+09], Mahout [18g], HBase [Geo11], Sqoop [TC13] and many more.

YARN (Yet Another Resource Negotiator) [14a; Vav+13] is the next generation Apache Hadoop platform, which introduces new architecture by decoupling the programming model from the resource management infrastructure and delegating many scheduling-related functions to per-application components. This new design [Vav+13] offers some improvements over the older platform: (1) Scalability; (2) Multi-tenancy; (3) Serviceability; (4) Locality awareness; (5) High Cluster Utiliza- tion; (6) Reliability/Availability; (7) Secure and auditable operation; (8) Support for programming model diversity; (9) Flexible Resource Model; and (10) Backward compatibility.

The major difference is that the functionality of the JobTracker is split into two new daemons – ResourceManager (RM) and ApplicationMaster (AM). The RM is a global service, managing all the resources and jobs in the platform. It consists of scheduler and ApplicationManager. The scheduler is responsible for allocation of resources to the various running applications based on their resource requirements. The ApplicationManager is responsible for accepting jobs-submissions and negotiating resources from the scheduler. The NodeManager (NM) agent runs on each worker. It is responsible for allocation and monitoring of node resources (CPU, memory, disk and network) usage and reports back to the ResourceManager (scheduler). An instance of the ApplicationMaster runs per-application on each node and negotiates the appropriate resource container from the scheduler. It is important to mention that the new MapReduce 2.0 maintains API compatibility with the older stable versions of Hadoop and therefore, MapReduce jobs can run unchanged.

Cloudera Hadoop Distribution (CDH) [CDH15; 14b] is 100% Apache-licensed open source Hadoop distribution offered by Cloudera. It includes the core Apache Hadoop elements - Hadoop Distributed File System (HDFS) and MapReduce (YARN), as well as several additional projects from the Apache Hadoop Ecosystem. All components are tightly integrated to enable ease of use and managed by a central application - Cloudera Manager [14c].

3.3.3 Experimental Setup

Hardware

The experiments were performed on a cluster consisting of 4 nodes connected directly through 1GBit Netgear switch, as shown on Figure 3.29. All 4 nodes are Dell PowerEdge T420 servers. The master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs each with 6 cores, 32GB of RAM and 1TB (SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drive. The worker nodes are equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and 4x 1TB (SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drives. More detailed specification of the node servers is provided in the Appendix B (Table B.1 and Table B.2).

3.3 Evaluating Hadoop Clusters with TPCx-HS 83 Fig. 3.29.: Cluster Setup

Tab. 3.28.: Summary of Total System Resources

Setup Description Summary Total Nodes: 4 x Dell PowerEdge T420 Total Processors/ Cores/Threads : 5 CPUs/ 30 Cores/ 60 Threads Total Memory: 128 GB Total Number of Disks: 13 x 1TB,SATA, 3.5 in, 7.2K RPM, 64MB Cache Total Storage Capacity: 13 TB Network: 1 GBit Ethernet

Table 3.28 summarizes the total cluster resources that are used in the calculation of the benchmark ratios in the next sections.

Software

This section describes the software setup of the cluster. The exact software versions that were used are listed in Table 3.29. Ubuntu Server LTS was installed on all 4 nodes, allocating the entire first disk. The number of open files per user was changed from the default value of 1024 to 65000 as suggested by the TPCx-HS benchmark and Cloudera guidelines [14d]. Additionally, the OS swappiness option was turned permanently off (vm.swappiness = 0). The remaining three disks, on all worker nodes, were formatted as ext4 partitions and permanently mounted with options noatime and nodiratime. Then the partitions were configured to be used by HDFS through the Cloudera Manager. Each 1TB disk provides in total 916.8GB of effective HDFS space, which means that all three workers (3 x 916.8GB = 8251.2GB = 8.0578TB) have in total around 8TB of effective HDFS space.

Cloudera CDH 5.2, with default configurations, was used for all experiments. Table 3.30 summarizes the software services running on each node. Due to the resource limitation (only 3 worker nodes) of our experimental setup, the cluster was config-

84 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.29.: Software Stack of the System under Test

Software Version Ubuntu Server 64 Bit 14.04.1 LTS, Trusty Tahr, Linux 3.13.0-32-generic Java (TM) SE Runtime Environment 1.6.0_31-b04, 1.7.0_72-b14 Java HotSpot (TM) 64-Bit Server VM 20.6-b01, mixed mode; 24.72-b04, mixed mode OpenJDK Runtime Environment 7u71-2.5.3-0ubuntu0.14.04.1 OpenJDK 64-Bit Server VM 24.65-b04, mixed mode Cloudera Hadoop Distribution 5.2.0-1.cdh5.2.0.p0.36 TPCx-HS Kit 1.1.2 ured to work with replication factor of 2. This means that our cluster can store at most 4TB of data.

Tab. 3.30.: Software Services per Node

Server Disk Drive Software Services Master Node Disk 1/ sda1 Operating System, Root, Swap, Cloudera Manager Services, Name Node, Secondary- Name Node, Hive Metastore, Hive Server2, Oozie Server, Spark History Server, Sqoop 2 Server, YARN Job History Server, Re- source Manager, Zookeeper Server Worker Nodes 1-3 Disk 1/ sda1 Operating System, Root, Swap, Data Node, YARN Node Manager Disk 2/ sdb1 Data Node Disk 3/ sdc1 Data Node Disk 4/ sdd1 Data Node

Network Setups

The initial cluster setup was using the shared 1GBit network available in our lab. However, as expected it turned out that it does not provide sufficient network speed for network intensive cluster applications. Also it was hard to estimate the actual available bandwidth of the shared network as it consists of multiple workstation machine, which are utilizing it in a none predictable manner. Therefore, it was clear that a dedicated network should be setup for our cluster. This was achieved by using a simple 1GBit commodity switch (Netgear GS108 GE, 8-Port), which connected all four nodes directly in a dedicated 1GBit network. To validate that our cluster setup was properly installed, the network speed was measured using a standard network tool called iperf [18t]. Using the provided instructions [18u; 18x], we obtained multiple measurements for our dedicated 1GBit network between the NameNode and two of our DataNodes, reported in Table 3.31. The iperf server was started by

3.3 Evaluating Hadoop Clusters with TPCx-HS 85 executing “$iperf -s” command on the NameNode and then executing two times the iperf clients using the “$iperf -client serverhostname -time 30 -interval 5 -parallel 1 -dualtest” command on the DataNodes. The iperf numbers show very stable data transfer of around 930 Mbits per second.

Tab. 3.31.: Network Speed

Run Server Client Time Interval Parallel Type Trans- Speed Trans- Speed (sec) fer 1(Mb- fer 2 (Mb- 1(Gbs) s/sec) 2(Gbs) s/sec) 1 Name Data 30 5 1 dual 3.24 929 3.25 930 Node Node test 1 2 Name Data 30 5 1 dual 3.24 928 3.25 931 Node Node test 1 1 Name Data 30 5 1 dual 3.25 930 3.25 931 Node Node test 2 2 Name Data 30 5 1 dual 3.24 928 3.25 931 Node Node test 2

The next steps is to test the different cluster setups using a network intensive Big Data benchmark like TPCx-HS in order to get a better idea of the implications of using a shared network versus a dedicated one.

3.3.4 Benchmarking Methodology

This section presents the TPCx-HS benchmark, its methodology and some of its major features as described in the current specification (version 1.3.0 from February 19, 2015) [TPCb; Nam+14].

The TPCx-HS was released in July 2014 as the first industry’s standard benchmark for Big Data systems. It stresses both the hardware and software components including the Hadoop run-time stack, Hadoop File System and MapReduce layers. The benchmark is based on the TeraSort workload [Apa15a], which is part of the Apache Hadoop distribution. Similarly, it consists of four modules: HSGen, HSDataCkeck, HSSort and HSValidate. The HSGen is a program that generates the data for a particular Scale Factor (see Clause 4.1 from the TPCx-HS specification) and is based on the TeraGen, which uses a random data generator. The HSDataCheck is a program that checks the compliance of the dataset and replication. HSSort is a program, based on TeraSort, which sorts the data into a total order. Finally, HSValidate is a program, based on TeraValidate, which validates if the output is correctly sorted.

A valid benchmark execution consists of five separate phases which has to be run sequentially to avoid any phase overlapping, as depicted on Figure 3.30. Additionally,

86 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Table 3.32 provides exact description of each of the execution phases. The benchmark is started by the script and consists of two consecutive runs, Run1 and Run2, as shown on Figure 3.30. No activities except file system cleanup are allowed between Run1 and Run2. The completion times of each phase/module (HSGen, HSSort and HSValidate) except HSDataCheck are currently reported. An important requirement of the benchmark is to maintain 3-way data replication throughout the entire experiment. In our case this criteria was not fulfilled, because of the limited resources of our clusters (only 3 worker nodes). All of our experiments were performed with 2-way data replication.

The benchmark reports the total elapsed time (T) in seconds for both runs. This time is used for the calculation of the TPCx-HS Performance Metric also abbreviated with HSph@SF. The run that takes more time and results in lower TPCx-HS Performance Metric is defined as the performance run. On the contrary, the run that takes less time and results in TPCx-HS Performance Metric is defined as the repeatability run. The benchmark reported performance metric is the TPCx-HS Performance Metric for the performance run.

Fig. 3.30.: TPCx-HS Execution Phases (version 1.3.0 from February 19, 2015) [TPCb]

The Scale Factor defines the size of the dataset, which is generated by HSGen and used for the benchmark experiments. In TPCx-HS, it follows a stepped size model. Table 3.33 summarizes the supported Scale Factors, together with the corresponding data sizes and number of records. The last column indicates the argument with which to start the TPCx-HS-master script.

3.3 Evaluating Hadoop Clusters with TPCx-HS 87 Tab. 3.32.: TPCx-HS Phases

Phase Description as provided in TPCx-HS specification (version 1.3.0 from February 19, 2015) (1) 1 “Generation of input data via HSGen. The data generated must be replicated 3-ways and written on a Durable Medium.” 2 “Dataset (See Clause 4) verification via HSDataCheck. The program is to verify the cardinality, size and replication factor of the generated data. If the HSDataCheck program reports failure then the run is considered invalid.” 3 “Running the sort using HSSort on the input data. This phase samples the input data and sorts the data. The sorted data must be replicated 3-ways and written on a Durable Medium.” 4 “Dataset (See Clause 4) verification via HSDataCheck. The program is to verify the cardinality, size and replication factor of the sorted data. If the HSDataCheck program reports failure then the run is considered invalid.” 5 “Validating the sorted output data via HSValidate. HSValidate validates the sorted data. This phase is not part of the primary metric but reported in the Full Disclosure Report. If the HSValidate program reports that the HSSort did not generate the correct sort order, then the run is considered invalid. “

Tab. 3.33.: TPCx-HS Scale Factors

Data Size Scale Fac- Number of Option to Start Run tor (SF) Records 100 GB 0.1 1 Billion ./TPCx-HS-master.sh -g 1 300 GB 0.3 3 Billions ./TPCx-HS-master.sh -g 2 1 TB 1 10 Billions ./TPCx-HS-master.sh -g 3 3 TB 3 30 Billions ./TPCx-HS-master.sh -g 4 10 TB 10 100 Billions ./TPCx-HS-master.sh -g 5 30 TB 30 300 Billions ./TPCx-HS-master.sh -g 6 100 TB 100 1000 Billions ./TPCx-HS-master.sh -g 7 300 TB 300 3000 Billions ./TPCx-HS-master.sh -g 8 1 PB 1000 10 000 Billions ./TPCx-HS-master.sh -g 9

The TPCx-HS specification defines three major metrics:

• Performance metric (HSph@SF)

• Price-performance metric ($/HSph@SF)

• Power per performance metric (Watts/HSph@SF)

88 Chapter 3 Evaluation of Big Data Platforms and Benchmarks The performance metric (HSph@SF) represents the effective sort throughput of the benchmark and is defined as:

SF HSph@SF = (3.3) T/3600 where SF is the Scale Factor (see Clause 4.1 from the TPCx-HS specification) and T is the total elapsed time in seconds for the performance run. 3600 seconds are equal to 1 hour.

The price-performance metric ($/HSph@SF) is defined and calculated as fol- lows:

P $/HSph@SF = (3.4) HSph@SF where P is the total cost of ownership of the tested system. If the price is in currency other than US dollars, the units can be adjusted to the corresponding currency. The last power per performance metric, which is not covered in our study, is expressed as Watts/HSph@SF, which have to be measured following the TPC-Energy requirements [18ag].

3.3.5 Experimental Results

This section presents the results of the performed experiments. The TPCx-HS bench- mark was run with three scale factors 0.1, 0.3 and 1 which generate respectively 100GB, 300GB and 1TB datasets. These three runs were performed for a cluster setup using both shared and dedicated 1Gbit networks. In the first part are presented and evaluated the metrics reported by the TPCx-HS benchmark for the different experiments. In the second part is analyzed the utilization of cluster resources with respect to the two network setups.

System Ratios

The system ratios are additional metrics defined in the TPCx-HS specification to better describe the system under test. These are the Data Storage Ratio and the Scale Factor to Memory Ratio. Using the Total Physical Storage (13TB) and the Total Memory (128GB or 0.125TB) reported in Table 1, we calculate them as follows:

T otal P hysical Storage Data Storage Ratio = (3.5) Scale F actor

3.3 Evaluating Hadoop Clusters with TPCx-HS 89 Scale F actor Scale Factor to Memory Ratio = (3.6) T otal P hysical Memory

Table 3.34 reports the two ratios for the three different scale factors used in the experiments.

Tab. 3.34.: TPCx-HS Related Ratios

Scale Factor Data Size Data Storage Ratio Scale Factor to Memory Ratio 0.1 100 GB 130 0.8 0.3 300 GB 43.3 2.4 1 1 TB 13 8

Performance

In this section are evaluated the results of the multiple experiments. The presented results are obtained by executing the TPCx-HS kit provided on the official TPC website [TPCb]. However, the reported times and metrics are experimental, not audited by any authorized organization and therefore not directly comparable with other officially published full disclosure reports.

Figure 3.31 illustrates the times of the two cluster setups (shared and dedicated 1GBit networks) for the three datasets 100GB, 300GB and 1TB. It can be clearly observed that for all the cases the dedicated 1GBit setup performs around 5 times better than the shared setup. Similarly, Figure 3.31 shows that the dedicated setup achieves between 5 and 6 times more HSph@SF metric than the shared setup.

Fig. 3.31.: TPCx-HS Execution Times and Metric

Table 3.35 summarizes the experimental results introducing additional statistical comparisons. The Data ∆ column represents the difference in percent of the Data Size to the data baseline in our case 100GB. In the first case, scale factor 0.3 increases the processed data with 200%, whereas in the second case the data is increased with 900%. The Time (Sec) shows the average time in seconds of two complete TPCx-HS runs for all the six test configurations. The following Time Stdv (%) shows the standard deviation of Time (Sec) in percent between the two runs. Finally, the

90 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Time ∆ (%) represents the difference in percent of Time (Sec) to the time baseline in our case scale factor 0.1. Here we observe that for the shared setup the execution time takes five times longer than the dedicated setup.

Tab. 3.35.: TPCx-HS Results

Scale Data Data ∆ Network Metric Time Time Time ∆ Factor Size (%) (HSph@SF) (Sec) Stdv (%) (%) 0.1 100 GB baseline shared 0.03 10721.75 0.59 baseline 0.3 300 GB 200 shared 0.03 32142.75 0.5 199.79 1 1 TB 900 shared 0.03 105483 0.41 883.82

0.1 100 GB baseline dedicated 0.16 2234.75 0.72 baseline 0.3 300 GB 200 dedicated 0.18 6001 0.89 168.53 1 1 TB 900 dedicated 0.17 21047.75 1.53 841.84

Figure 3.32 illustrates the scaling behavior between the two network setups based on different data sizes. The results show that the dedicated setup has a better scaling behavior than the shared setup. Additionally, the increase of data size improves the scaling behavior of both setups.

Fig. 3.32.: TPCx-HS Scaling Behavior (0 on the X and Y-axis is equal to the baseline of SF 0.1/100GB)

The TPCx-HS benchmark consists of five phases, which were explained in Section 4. Table 3.36 depicts the average times of the three major phases, together with their standard deviations in percent. Clearly the data sorting phase (HSSort) takes the most processing time, followed by the data generation phase (HSGen) and finally the data validation phase (HSValidate).

3.3 Evaluating Hadoop Clusters with TPCx-HS 91 Tab. 3.36.: TPCx-HS Phase Times

Scale Network HSGen HSGen HSSort HSSort HS- Val- HS- Val- Factor/ (Sec) Stdv (Sec) Stdv idate idate Data (%) (%) (Sec) Stdv Size (%) 0.1 / shared 3031.62 2.05 7391.83 1.18 289.02 0.9 100GB 0.3 / shared 9149.36 0.99 22501.87 0.66 482.33 2.8 300GB 1 / 1TB shared 29821.68 1.35 74432.82 1.21 1219.52 1.39

0.1 / dedicated 543.06 0.47 1394.6 0.97 288.19 0.01 100GB 0.3 / dedicated 1348.98 0.5 4112.33 0.57 530.97 7.27 300GB 1 / 1TB dedicated 4273.3 0.16 15271.96 3.72 1493.41 6.35

Price-Performance

The price-performance metric of the TPCx-HS benchmark that reviewed in Section 4 divides the total cost of ownership (P) of the system under test on the TPCx-HS metric (HSph@SF) for a particular scale factor. The total cost of our cluster for the dedicated 1Gbit network setup is summarized in Table 3.37. It include only the hardware prices as the software that is used in the setup is available for free. There are no support, administrative and electricity costs included. The total price of the shared network setup is equal to 6730 C, which basically includes all the components listed in Table 3.37 without the network switch.

Tab. 3.37.: Total System Cost with 1GBit Switch

Hardware Components Price incl. Taxes (C) 1 x Master Node (Dell PowerEdge T420) 1803 1 x Switch (Netgear GS108 GE, 8-Port, 1 GBit) 30 3 x Data Nodes (Dell PowerEdge T420) 4228 9 x Disks (Western Digital Blue Desktop, 1TB) 450 Additional Costs (Cables, Mouse & Monitor) 249 Total Price (C) 6760

Table 3.38 summarizes the price-performance metric for the tested scale factors in the two network setups. The lower price-performance metric indicates better system performance. In our experiments this is the dedicated setup, which has around 6 times smaller price-performance than the shared setup.

92 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.38.: Price-Performance Metrics

Scale Data Size Network Metric Price-Performance Factor (HSph@SF) Metric (C/HSph@SF) 0.1 100 GB shared 0.03 224333.33 0.3 300 GB shared 0.03 224333.33 1 1 TB shared 0.03 224333.33

0.1 100 GB dedicated 0.16 42250 0.3 300 GB dedicated 0.18 37555.56 1 1 TB dedicated 0.17 39764.71

Using the price-performance formula, we can also find the maximal P (maxP) and respectively the highest switch price for which the dedicated setup will still perform better than the shared setup.

P (shared) 6730 AC/HSph@SF (shared) = = = 224333.33 HSph@SF 0.03 maxP (dedicated) 224333.33 > (3.7) 0.18 224333.33 ∗ 0.18 > maxP (dedicated) 40380.00 > maxP (dedicated)

The highest total system cost for the dedicated 1Gbit setup should be less than 40 380C in order for the system to achieve better price-performance (C/HSph@SF) than the shared setup. This represents a difference of about 33 620C, which just shows how huge is the gain in terms of cost when adding the 1Gbit switch. In summary, the dedicated 1Gbit network setup costs around 30 C more than the shared setup (due to the already existing network infrastructure), but it achieves between 5 and 6 times better performance.

3.3.6 Resource Utilization

The following section presents graphically the cluster resource utilization in terms of CPU, memory, network and number of map and reduce jobs. The reported statistics are obtained using the Performance Analysis Tool (PAT) [Too18] while running the TPCx-HS with scale factor 0.1 (100GB) for both shared and dedicated network setups. The average values obtained in the measurement are reported in Table C.1 and Table C.2 in the Appendix C. The graphics represent complete benchmark run, consisting of Run1 and Run2 as described in Section 3.3.4, for both Master and Worker nodes. The goal is to compare and analyze the cluster resource utilization between the two different network setups.

3.3 Evaluating Hadoop Clusters with TPCx-HS 93 CPU

Figure 3.33 shows the CPU utilization in percent of the Master node for the dedicated and shared network setups with respect to the elapsed time (in seconds). The System % (in green) represents the CPU utilization that occurred when executing at the kernel (system) level. Respectively, the User % (in red) represents the CPU utilization when executing at the application (user) level. Finally, the IOwait % (in blue) represent the time in which the CPU was idle waiting for an outstanding disk I/O request. Comparing both graphs, one can observe a slightly higher CPU utilization in the case of dedicated 1Gbit network. However, in both cases the overall CPU utilization (System and User %) is around 2%, leaving the CPU heavily underutilized.

Fig. 3.33.: Master Node CPU Utilization

Similar to Figure 3.33, Figure 3.34 depicts the CPU utilization for one of the three Worker nodes. Clearly the dedicated setup utilizes better the CPU on both system and user level. In the same way, the IOwait times for the dedicated setup are much higher than the shared one. On average, the overall CPU utilization for the shared network is between 12% and 20%, whereas the overall for the dedicated network is between 56% and 70%. This difference is especially observed in the data generation and sorting phases, which are highly network intensive.

Figure 3.35 illustrates the number of context switches per second, which measure the rate at which the threads/processes are switched in the CPU. The higher number of context switches indicates that the CPU spends more time on storing and restoring process states instead of doing real work [Tan14]. In both graphics, we observe that the number of context switches per second is stable, on average between 10000 and 11000. This means that in both cases the Master node is equally utilized.

Similarly, Figure 3.36 depicts the context switches per second for one of the Worker nodes. In the dedicated 1Gbit case, we observe the number of context switches

94 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.34.: Worker Node CPU Utilization

Fig. 3.35.: Master Node Context Switches

varies greatly in the different benchmark phases. The average number of context switches per second is around 20788. In the case of shared network, the average number of context switches per second is around 14233 and the variation between the phases is much smaller.

3.3 Evaluating Hadoop Clusters with TPCx-HS 95 Fig. 3.36.: Worker Node Context Switches

Memory

Figure 3.37 shows the main memory utilization in percent of the Master node for the two network setups. In the dedicated 1Gbit setup the average memory used is around 48%, whereas in the shared setup it is around 91.4%.

Fig. 3.37.: Master Node Memory Utilization

The same trend is observed on Figure 3.38, which depicts the free memory in Kbytes for the Master node. In the case of dedicated network, the average free memory is 16.3GB (17095562Kbytes), which is the remaining 52% of not utilized memory.

96 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Respectively, for the shared network the average free memory is around 2.7GB (2825635Kbytes), which is the remaining 8.6% of not utilized memory.

Fig. 3.38.: Master Node Free Memory

In the same way, Figure 3.39 illustrates the main memory utilization in percent for one of the Worker nodes. For the dedicated 1Gbit case, the average memory used is around 92.3%, whereas for the shared 1Gbit case it is around 92.9%. This confirms the great resemblance in both graphics, indicating that the Worker nodes are heavily utilized in the two setups. It will be advantageous to consider adding more memory to our nodes as it can further improve the performance by enabling more parallel jobs to be executed [Tan14].

Fig. 3.39.: Worker Node Memory Utilization

3.3 Evaluating Hadoop Clusters with TPCx-HS 97 Figure 3.40 shows the free memory and the amount of cached memory in Kbytes for one of the Worker nodes. For the dedicated 1Gbit setup, the average free memory is around 2.4GB (2528790Kbytes), which is exactly the 7.7% of non-utilized memory. Similarly, for the shared 1Gbit setup, the average free memory is around 2.2GB (2326411Kbytes) or around 7% not utilized memory.

Fig. 3.40.: Worker Node Free Memory

Disk

The following graphics represent the number of read and write requests issued to the storage devices per second. Figure 3.41 shows that the Master node has very few read requests for the two network setups. On the other hand, the average number of write requests per second for the dedicated 1Gbit setup is around 3.1 and respectively around 1.7 for the shared network setup.

The following graphics illustrate the I/O latencies (in milliseconds), which is the average time spent from issuing of I/O requests to the time of their service by the device. This also includes the time spent in the device queue and the time for servicing them. Figure 3.43 shows the Master node latencies, which for the dedicated setup on average are around 0.15 milliseconds and respectively around 0.17 milliseconds for the shared setup.

Similarly, Figure 3.44 depicts the Worker node latencies. For the dedicated setup, the average I/O latency is around 137 milliseconds and respectively around 70 milliseconds for the shared setup. We also observe that for the shared network setup, there are multiple I/O latencies taking around 200 to 300 milliseconds, whereas for the dedicated setup the longer latencies are much less.

The following figures depict the number of Kbytes read and written on the storage devices per second. Figure 3.45 illustrates this for the Master node. In both graphs, there are no read requests but only write one. On average around 31 Kbytes are

98 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.41.: Master Node I/O Requests

Fig. 3.42.: Worker Node I/O Requests

written per second by the dedicated setup and respectively around 25 Kbytes are written per second by the shared setup.

Similarly, Figure 3.46 shows the disk bandwidth for one of the Worker nodes. The average read throughput is around 6.4MB (6532Kbytes) per second for the dedicated setup and around 1.4MB (1438Kbytes) per second for the shared setup. Respectively the average write throughput for the dedicated setup is around 18.6MB (19010Kbytes) per second and 4MB (4087Kbytes) per second for the shared setup. In summary, the dedicated network achieves much better throughput levels, which indicates more efficient data processing and management.

3.3 Evaluating Hadoop Clusters with TPCx-HS 99 Fig. 3.43.: Master Node I/O Latencies

Fig. 3.44.: Worker Node I/O Latencies

Network

The following figures depict the number of received and transmitted Kbytes per second. For the Master node, the average number of received Kbytes per second is around 53 for the dedicated setups and around 22 for the shared setup. Respec- tively, the average transmitted Kbytes per second is around 42 for the dedicated configuration and around 19 for the shared configuration.

100 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.45.: Master Node Disk Bandwidth

Fig. 3.46.: Worker Node Disk Bandwidth

Analogously, Figure 3.48 shows the network transfer for one of the Worker nodes. In the case of dedicated setup, on average are received 32.8MB (33637Kbytes) per sec- ond and transmitted 30.6MB (31364Kbytes) per second. In the case of shared setup, on average are received 7.1MB (7297Kbytes) per second and transmitted 6.4MB (6548Kbytes) per second. Obviously, the dedicated 1Gbit network achieves almost 5 times better utilization of the network, resulting in faster overall performance.

3.3 Evaluating Hadoop Clusters with TPCx-HS 101 Fig. 3.47.: Master Node Network I/O

Fig. 3.48.: Worker Node Network I/O

Mappers and Reducers

Figure 3.49 shows the number of active map and reduce jobs in the different benchmark phases for both network setups. The behavior of the two graphics is very similar, except the fact that the shared setup is around 5 times slower than the dedicated setup.

102 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Fig. 3.49.: Worker Node JVM Count

3.3.7 Lessons Learned

The report presents a performance evaluation of the Cloudera Hadoop Distribution through the use of the TPCx-HS benchmark, which is the first officially standardized Big Data benchmark. In particular, our experiments compare two cluster setups: the first one using shared 1Gbit network and the second one using dedicated 1Gbit network. Our results show that the cluster with dedicated network setup is around 5 times faster than the cluster with shared network setup in terms of: 1) Execution time; 2) HSphSF metric; 3) Average read and write throughput per second; and 4) Network utilization.

On average, the overall CPU utilization for the shared case is between 12% and 20%, whereas for the dedicated network it is between 56% and 70%. The average main memory usage, for the dedicated setup is around 92.3%, whereas for the shared setup it is around 92.9%. Furthermore, based on the price-performance formula it can be concluded that the 5 times performance gain for the dedicated 1Gbit setup is equal to around 33 620C in terms of money when compared to the shared 1Gbit setup.

Overall, our experiments show that the TPCx-HS benchmark is a good choice for stressing a Hadoop cluster network and measuring its performance under network and I/O-intensive load. In the future, we plan to stress test the dedicated setup with other widely used Big Data benchmarks and workloads.

Acknowledgements

This research is supported by the Frankfurt Big Data Lab at Goethe University Frankfurt. The authors would like to thank Naveed Mushtaq, Karsten Tolle, Marten Rosselli and Roberto V. Zicari for their helpful comments and support. We would like to thank Jeffrey Buell of VMware for his valuable feedback and corrections.

3.3 Evaluating Hadoop Clusters with TPCx-HS 103 3.4 Performance Evaluation of Spark SQL using BigBench

Abstract

In this paper we present the initial results of our work to execute BigBench on Spark. First, we evaluated the scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive ones. Our experiments show that: (1) for both Hive and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive. The paper is based on the following publications:

• Todor Ivanov and Max-Georg Beer, Evaluating Hive and Spark SQL with Big- Bench, Frankfurt Big Data Lab, Technical Report No.2015-2, arXiv:1512.08417, [IB15b].

• Todor Ivanov and Max-Georg Beer, Performance Evaluation of Spark SQL using BigBench, in Proceedings of the 6th Workshop on Big Data Benchmark- ing (6th WBDB), June 16-17, 2015, Toronto, Canada, [IB15c].

• Evaluation of BigBench on Apache Spark Compared to MapReduce - Max- Georg Beer, Master Thesis August 2015, [Bee15].

Keywords: Big Data, Benchmarking, BigBench, Hive, Spark SQL.

3.4.1 Introduction

In the recent years, the variety and complexity of Big Data technologies is steadily growing. Both industry and academia are challenged to understand and apply these technologies in an optimal way. To cope with this problem there is a need of new standardized Big Data benchmarks that cover the entire Big Data lifecycle as out-lined by multiple studies [Che+12; Car12; CRK12]. The first industry standard Big Data benchmark called TPCx-HS [Nam+14] was recently released. It is designed to stress test a Hadoop cluster. While the TPCx-HS is a micro-benchmark (highly I/O and network bound), there is still a need of an end-to-end application-level benchmark [Bar+13a] that tests the analytical capabilities of a Big Data platform. BigBench [Gha+13b; Bar+14] has been proposed with the specific intention to fulfill this requirements and is currently available for public re-view as TPCx-BB [TPC18b]. It consists of 30 complex queries. 10 queries were taken from the TPC-DS benchmark [TPCa], whereas the remaining 20 queries were based on the prominent business cases of Big Data analytics identified in the McKinsey report [Man+11]. The BigBench’s data model, depicted on Figure 3.50, was derived from TPC-DS and extended with unstructured and semi-structured data to fully represent the Big Data Variety characteristic. The data generator is an extension of PDGF [Rab+10a] that allows to generate all those three data types as well as efficiently scale the

104 Chapter 3 Evaluation of Big Data Platforms and Benchmarks data for large scale factors. Chowdhury et al. [Cho+13b] presented a BigBench implementation for the Hadoop ecosystem [Int18a]. All queries are implemented using Apache Hadoop, Hive, Mahout and the Natural Language Processing Toolkit (NLTK). Table 3.39 summarizes the number and type of queries of this BigBench implementation.

Fig. 3.50.: BigBench Schema [Cho+13b]

Recently, Apache Spark [Zah+12a] has become a popular alternative to the MapRe- duce framework, promising faster processing and offering analytical capabilities by Spark SQL [Arm+15a]. BigBench is a technology agnostic Big Data analytical benchmark, which renders it a good candidate to be implemented in Spark and used as a platform evaluation and comparison tool [Bar+14] . Our main objective is to successfully run BigBench on Spark and compare the results with its current MapReduce (MR) implementation [18n]. The first step of our work was to execute the largest group of 14 HiveQL queries on Spark. This was possible due to the fact that Spark SQL [Arm+15a] fully supports the HiveQL syntax.

In this paper we describe our approach to run BigBench on Spark and present an evaluation of the first experimental results. Our main contributions are:

• Scripts to automate the execution and validation of query results.

• Evaluation of the query scalability on the basis of four different scale factors.

• Comparison of the Hive and Spark SQL query performance.

• Resource utilization analysis of set of seven representative queries.

3.4 Performance Evaluation of Spark SQL using BigBench 105 Tab. 3.39.: BigBench Queries

Query Types Queries Number of Queries Pure HiveQL Q6, Q7,Q9, Q11, Q12, Q13, 14 Q14, Q15, Q16, Q17, Q21, Q22, Q23, Q24 Java MapReduce with HiveQL Q1, Q2 2 Python Streaming MR with HiveQL Q3, Q4, Q8, Q29, Q30 5 Mahout (Java MR) with HiveQL Q5, Q20, Q25, Q26, Q28 5 OpenNLP (Java MR) with HiveQL Q10, Q18, Q19, Q27 4

The remaining of the paper is organized as follows: Section 3.4.2 describes the neces- sary steps to implement BigBench on Spark; Section 3.4.3 discusses the major issues and solutions that we applied during the experiments; Section 3.4.4 presents the experiments and analyzes the results; Section 3.4.5 evaluates the queries’ resource utilization. Finally, Section 3.4.6 summarizes the lessons learned and future work.

3.4.2 Towards BigBench on Spark

Spark [Zah+12a] has emerged as a promising general purpose distributed computing framework that extends the MapReduce model by using main memory caching to improve performance. It offers multiple new functionalities such as stream processing (Spark Streaming), machine learning (MLlib), graph processing (GraphX), query processing (Spark SQL) and support for Scala, Java, Python and R programming languages. Due to these new features the BigBench benchmark is a suitable candidate for a Spark implementation, since it consists of queries with very different processing requirements.

Before starting the implementation process, we had to evaluate the different query groups (listed in Table 3.39) of the available BigBench implementation and identify the adjustments that are necessary. We started by analyzing the largest group of 14 pure HiveQL queries. Fortunately, Spark SQL [Arm+15a] supports the HiveQL syntax, which allowed us to run this group of queries without any modifications. In Section 3.4.4, we evaluate the Spark SQL benchmarking results of these queries and compare them with the respective Hive results. It is important to mention that the experimental part of this paper focuses only on Spark SQL and does not evaluate any of the other available Spark components (Spark Streaming, MLlib, GraphX, etc.).

Based on our analysis, we identified multiple steps and open issues that should be completed in order to successfully run all BigBench queries on Spark:

• Re-implementing of the MapReduce jar scripts (in Q1 and Q2) and using external tools like Mahout and OpenNLP running with Spark.

• Making sure that external scripts and files are distributed to Spark executors (Q01, Q02, Q03, Q04, Q08, Q10, Q18, Q19, Q27, Q29, Q30).

106 Chapter 3 Evaluation of Big Data Platforms and Benchmarks • Adjusting the different null value expression from Hive (\\N) to the respective Spark SQL (null) value (Q3, Q8, Q29, Q30).

• Similar to Hive, new versions of Spark SQL should automatically determine query specific settings, since it is not trivial and very time consuming process.

3.4.3 Issues and Improvements

During the query analysis phase, we had to ensure that both the MapReduce and Spark query results are correct and valid. In order to achieve this, the query results should be deterministic and not empty, so that results from query runs of the same scale factor on different platforms are comparable. Furthermore, having an official reference for the data model and result tables (including row counts and sample values) for the various scale factors like the one provided by the TPC-DS benchmark [TPCa] can be helpful for both developers and operators. However, this is not the case with the current MapReduce implementation. The major issue that we encountered were the empty query results, which we solved by adjusting the query parameters, except Q1 (MapReduce) which needs additional changes. By using scripts [18n], we collected row counts and sample values for multiple scale factors, which we then used to validate the correctness of the Spark queries. The reference values together with an ex-tended description are provided in our technical report [IB15b].

In spite of our efforts to provide a BigBench reference for result validation, the Mahout queries generate a result text file with varying non-deterministic values. Similarly, the OpenNLP queries generate their results based on randomly generated text attributes, which changed with every new data generation. Validating queries having non-deterministic results is hardly possible.

Finally, we integrated all modifications mentioned above in a setup project that includes a modified version of BigBench 1.0, available on GitHub [18n]. In summary, the setup project provides the following benefits: (1) Simplifying commonly used commands like generating and loading data that normally need a lot of unexpressive skip parameters. (2) Running a subset of queries successively. (3) Utilizing the parse-big-bench [15a] tool to gather the query execution times in a spreadsheet file even if only a subset of them is executed. (4) Allowing the validation of executed queries by automatically storing row counts and sample row values for every result and BigBench’s data model table. (5) Improving cleanup of temporary files, i.e. log files created by BigBench.

3.4.4 Performance Evaluation

Experimental Setup

The experiments were performed on a cluster consisting of 4 nodes connected directly through a 1GBit Netgear switch. All 4 nodes are Dell PowerEdge T420 servers. The master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs each with 6 cores, 32GB of main memory and 1TB hard drive. The 3 worker nodes

3.4 Performance Evaluation of Spark SQL using BigBench 107 are equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and 4x 1TB (SATA, 7.2K RPM, 64MB Cache) hard drives. Ubuntu Server 14.04.1 LTS was in-stalled on all 4 nodes, allocating the entire first disk. The Cloudera’s Hadoop Distribution (CDH) version 5.2.0 was installed on the 4 nodes with the configuration parameters listed in the next section. 8TB were used in HDFS file system out of the total storage capacity of 13TB. Due to the small number of cluster nodes, the cluster was con-figured to work with replication factor of two. The experiments were performed using our modified version of BigBench [18n], Hive version 0.13.1 and Spark version 1.4.0-SNAPSHOT (March 27th 2015). A comprehensive description of the experimental environment is available in our report [IB15b].

Cluster Configuration

Since determining the optimal cluster configuration is very time consuming, our goal was to find a stable one that produces valid query results for the highest tested scale factor (in our case 1000GB). To achieve this, we applied an iterative approach of executing BigBench queries, adjusting the cluster configuration parameters and vali- dating the query results. First, we started by adapting the default CDH configuration to our cluster resources which resulted in a configuration that we called initial. After performing a set of tests, we applied the best practices published by Sandy Ryza [15b] that were especially relevant for Spark. This resulted in a configuration that we called final and was used for the real experiments presented in the next section. Table 3.40 lists the important parameters for all the three cluster configurations.

Figure 3.51 depicts the improvements in execution times (in %) between the initial and final cluster configuration for a set of queries executed on Hive and Spark SQL for 1000 scale factor representing 1TB data size.

Fig. 3.51.: Improvements between Initial and Final Cluster Configuration for 1TB data size

All queries except Q7 benefited from the changes in the final configuration. The Hive queries improved on average with 1.3%, except Q9. The reason for Q9 to improve with 76% was that we re-enabled the Hive MapJoins (hive.auto.convert.join) and increased the Hive client Java heap size. For the Spark SQL queries, we observed on average an improvement of 13.7%, except Q7 which takes around 32% more time to complete and will be fully investigated in our future work.

108 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.40.: Cluster Configuration Parameters

Component Parameter Default Initial Con- Final Con- Configura- figuration figuration tion YARN yarn.nodemanager. 8GB 28GB 31GB resource.memory-mb yarn.scheduler. 8GB 28GB 31GB maximum-allocation- mb yarn.nodemanager. re- 8 8 11 source. cpu-vcores Spark master local yarn yarn num-executors 2 12 9 executor-cores 1 2 3 executor-memory 1GB 8GB 9GB spark.serializer org.apache. org.apache. org.apache. spark. se- spark. se- spark. se- rializer. rializer. rializer. JavaSerial- JavaSerial- KryoSerial- izer izer izer MapReduce mapreduce.map.java. 788MB 2GB 2GB opts.max.heap mapreduce.reduce.java. 788MB 2GB 2GB opts.max.heap mapreduce.map. mem- 1GB 3GB 3GB ory.mb mapreduce.reduce. 1GB 3GB 3GB memory.mb Hive hive.auto.convert.join TRUE FALSE TRUE (Q9 only) Client Java Heap Size 256MB 256MB 2GB

BigBench Data Scalability on MapReduce

In this section we present the experimental results for 4 tested BigBench scale factors (SF): 100GB, 300GB, 600GB and 1000GB (1TB). Our cluster used the final configuration presented in Table 3.40. Utilizing our scripts, each BigBench query was executed 3 times and the average value was taken as a representative result, also listed in Table 3.41. The absolute times for all experiments are available in our technical report [IB15b].

Figure 3.52 shows all BigBench query execution times for the available MapReduce implementation of BigBench. The presented times for 300GB, 600GB and 1TB are normalized with respect to 100GB SF as the baseline. Considering the execution

3.4 Performance Evaluation of Spark SQL using BigBench 109 Fig. 3.52.: BigBench + MapReduce Query Times normalized with respect to 100GB SF

times in relation to the different data sizes, we can see that each query differs in its scaling behavior. Longer normalized times indicate that the execution became slower with the increase of the data size, whereas shorter times indicate better scalability with the increase of the data size. Q4 has the worst data scaling behavior taking around 2.1 times longer to process 300GB, 6 times longer to process 600GB and 12 times longer to process 1TB data when compared to the 100GB SF baseline. Q30, Q5 and Q3 show similar scaling behavior. All of them except Q5 (Mahout) are implemented in Python Streaming MR. On the contrary Q27 is almost unchanged (within the range of +/-0.3 times) with the increase of the data size. Likewise Q19, Q10 implemented in OpenNLP and Q23 in pure HiveQL have slightly worse scaling behaviors.

BigBench Data Scalability on Spark SQL

This section investigates the scaling behavior of the 14 pure HiveQL BigBench queries executed on Spark SQL using 4 different scale factors (100GB, 300GB, 600GB and 1000GB). Similar to Figure 3.52, the presented times for 300GB, 600GB and 1000GB are normalized with respect to 100GB scale factor as baseline and depicted on Figure 3.53. The average values from the three executions are listed in Table 3.42.

Tab. 3.41.: Average query times for the four tested scale factors (100GB, 300GB, 600GB and 1000GB). The column ∆ (%) shows the time difference in % between the baseline 100GB SFs and the other three SFs for Hive/MapReduce.

Hive/MapReduce SF 100GB 300GB 600GB 1000GB Time min. min. ∆ (%) min. ∆ (%) min. ∆ (%) Q1 3.75 5.52 47.2 8.11 116.27 10.48 179.47 Q2 8.23 21.07 156.01 40.11 387.36 68.12 727.7 Q3 9.99 26.32 163.46 53.45 435.04 90.55 806.41 Q4 71.37 221.3 210.1 502 603.33 928.7 1201.2 Q5 27.7 76.56 176.39 155.7 462.02 272.5 883.86 Q6 6.36 10.69 68.08 16.73 163.05 25.42 299.69 Q7 9.07 16.92 86.55 29.51 225.36 46.33 410.8

110 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Continuation of Table 3.41 SF 100GB 300GB 600GB 1000GB Time min. min. ∆ (%) min. ∆ (%) min. ∆ (%) Q8 8.59 17.74 106.52 32.46 277.88 53.67 524.8 Q9 3.13 6.56 109.58 11.5 267.41 17.72 466.13 Q10 15.44 19.67 27.4 24.29 57.32 22.92 48.45 Q11 2.88 4.61 60.07 7.46 159.03 11.24 290.28 Q12 7.04 11.6 64.77 18.67 165.2 29.86 324.15 Q13 8.38 13 55.13 20.23 141.41 30.18 260.14 Q14 3.17 5.48 72.87 8.99 183.6 13.84 336.59 Q15 2.04 3.01 47.55 4.47 119.12 6.37 212.25 Q16 5.78 14.83 156.57 29.13 403.98 48.85 745.16 Q17 7.6 10.91 43.55 14.6 92.11 18.57 144.34 Q18 8.53 11.02 29.19 14.44 69.28 27.6 223.56 Q19 6.56 7.22 10.06 7.58 15.55 8.18 24.7 Q20 8.38 20.29 142.12 39.32 369.21 64.83 673.63 Q21 4.58 6.89 50.44 10.22 123.14 14.92 225.76 Q22 16.64 19.43 16.77 19.82 19.11 29.84 79.33 Q23 18.2 20.51 12.69 23.22 27.58 25.16 38.24 Q24 4.79 7.02 46.56 10.3 115.03 14.75 207.93 Q25 6.23 11.21 79.94 19.99 220.87 31.65 408.03 Q26 5.19 8.57 65.13 15.08 190.56 22.92 341.62 Q27 0.91 0.63 -30.77 0.98 7.69 0.7 -23.08 Q28 18.36 21.24 15.69 24.77 34.91 28.87 57.24 Q29 5.17 11.73 126.89 22.78 340.62 37.21 619.73 Q30 19.48 57.68 196.1 119.9 515.3 201.2 932.85 End of Table

Tab. 3.42.: Average query times for the four tested scale factors (100GB, 300GB, 600GB and 1000GB). The column ∆ (%) shows the time difference in % between the baseline 100GB SFs and the other three SFs for Spark SQL.

Hive/MapReduce SF 100GB 300GB 600GB 1000GB Time min. min. ∆ (%) min. ∆ (%) min. ∆ (%) Q6 2.54 3.52 38.58 4.83 90.16 6.7 163.78 Q7 2.54 6.04 137.8 21.47 745.28 41.07 1516.93 Q9 1.24 1.71 37.9 2.31 86.29 2.82 127.42 Q11 1.16 1.38 18.97 1.68 44.83 2.07 78.45

3.4 Performance Evaluation of Spark SQL using BigBench 111 Continuation of Table 3.42 SF 100GB 300GB 600GB 1000GB Time min. min. ∆ (%) min. ∆ (%) min. ∆ (%) Q12 1.96 3.06 56.12 4.92 151.02 7.56 285.71 Q13 2.43 3.59 47.74 5.57 129.22 7.98 228.4 Q14 1.24 1.56 25.81 2.1 69.35 2.83 128.23 Q15 1.4 1.59 13.57 1.93 37.86 2.36 68.57 Q16 3.41 7.88 131.09 23.32 583.87 43.65 1180.06 Q17 1.56 2.19 40.38 2.91 86.54 3.55 127.56 Q21 2.68 10.64 297.01 27.18 914.18 48.08 1694.03 Q22 36.66 60.69 65.55 88.92 142.55 122.68 234.64 Q23 16.68 27.02 61.99 52.11 212.41 69.01 313.73 Q24 3.33 15.27 358.56 42.19 1166.97 77.05 2213.81

It is noticeable that Q24 achieves the worst data scalability taking around 3.6 times longer to process 300GB, 11.7 times longer to process 600GB and 22 times longer to process 1TB data when compared to the 100GB SF baseline. Likewise Q21, Q7 and Q16 have slightly improved data scalability behavior. On the contrary Q15 has the best data scalability taking around 0.14 times for 300GB, 0.4 times for 600GB and 0.7 times longer for 1TB data when compared to the 100GB SF baseline. Analogously Q11, Q9 and Q14 have slightly worse scalability behavior.

Fig. 3.53.: BigBench + Spark SQL Query Times normalized with respect to 100GB SF

In summary, our experiments showed that with the increase of the data size the BigBench queries perform on average better than the linear scaling behavior for both the Hive and Spark SQL executions. The only exception for MapReduce is Q4, whereas for Spark SQL these are multiple Q7, Q16, Q21 and Q24. The reason for this behavior probably lies in the reported join issues [15c] in the utilized Spark SQL version.

112 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Hive and Spark SQL Comparison

In addition to the scalability evaluation we compared the query execution time of the 14 pure HiveQL BigBench queries in Hive and Spark SQL with regard to different scale factors.

Fig. 3.54.: Hive to Spark SQL Query Time Ratio defined as ((HiveTime*100)/SparkTime)-100)

Figure 3.54 shows the Hive to Spark SQL query time ratio in % defined as ((HiveTime * 100) / SparkTime) - 100). Positive values indicate faster Spark SQL query execution compared to the Hive ones, whereas negative values indicate slower Spark SQL execution in comparison to Hive. This figure illustrates that for Q6, Q9, Q11, Q14 and Q15 Spark SQL performs between 46% and 528% faster than Hive.

It is noticeable that this difference increases with a higher data size. For Q12, Q13 and Q17, we observed that the Spark SQL execution times raise slower with the in-crease of the data sizes, compared to the previous group of queries. On the contrary Q7, Q16, Q21, Q22, Q23 and Q24 drastically increase their Spark SQL execution time for the larger data sets. This results in a declining query time ratio. A highly probable reason for this behavior can be the reported join issue [15c] in the utilized Spark SQL version.

3.4.5 Query Resource Utilization

This section analyses the resource utilization of a set of representative queries, which are selected based on their behavior presented in the previous section. The first part evaluates the resource utilization of four queries (Q4, Q5, Q18 and Q27) executed on MapReduce, whereas the second compares three HiveQL queries (Q7, Q9 and Q24) executed on both Hive and Spark SQL. The presented metrics (CPU utilization, disk I/O, memory utilization and network I/O) are gathered using the Intel’s Performance Analysis Tool (PAT) [Too18] while executing the queries with 1TB data size. A full summary of the measured results is available in our technical report [IB15b].

3.4 Performance Evaluation of Spark SQL using BigBench 113 MapReduce Queries

Queries Q4, Q5, Q18 and Q27 are selected for further analysis based on their scalability behavior and implementation details (Mahout, Python Streaming and OpenNLP).

BigBench’s Q4 is chosen for resource evaluation because it is both the slowest of all 30 queries and also shows the worst data scaling behavior on MapReduce. It per- forms a shopping cart abandonment analysis: For users who added products in their shopping carts but did not check out in the online store, find the average number of pages they visited during their sessions [Rab+14]. The query is implemented in HiveQL and executes additional python scripts.

Analogously Q5 was chosen because it is implemented in both HiveQL and Mahout. It builds a model using logistic regression: Based on existing users online activities and demographics, for a visitor to an online store, predict the visitors likelihood to be interested in a given category [Rab+14].

Next, we selected Q27 as it showed an almost unchanged behavior when executed with different data sizes. It extracts competitor product and model names (if any) from online product reviews for a given product [Rab+14]. The query is implemented in HiveQL and uses the Apache OpenNLP machine learning library for natural language text processing [18h]. In order to ensure that this behavior is not caused by the use of the OpenNLP library, Q18 using text processing was selected for resource evaluation. It identifies the stores with flat or declining sales in three consecutive months and check if there are any negative reviews regarding these stores available online [Rab+14].

The average values of the measured metrics are shown in Table 3.43 for all four MapReduce queries. Additionally, the detailed figures for the CPU (Figure F.1), network (Figure F.2) and disk utilization (Figure F.3) in relation to the execution time for the four evaluated queries are included in the Appendix F.

It can be observed that Q4 has the highest memory utilization (around 96%) and the highest I/O wait time (around 5%), meaning that the CPU is blocked to wait for the result of outstanding disk I/O requests. The query also has the highest number of context switches per second on average as well as the highest I/O latency time. Both factors are an indication for memory swapping causing massive I/O operations. Taking into account all of the above described metrics, it is no surprise that Q4 is the slowest of all the 30 BigBench queries.

Regarding Q5, it has the highest network traffic (around 8-9 MB/sec) and the high- est number of read request per second compared to the other three queries. It is also utilizing around 92% of the memory. Interestingly, the Mahout execution starts after 259 minutes (15 536 seconds) in the Q5 execution. It takes only around 18 minutes and utilizes very few resources in comparison to the HiveQL part of the query. Similar to Q5, Q18 is also memory bound with around 90% utilization. However, it has the highest CPU usage (around 56%) and the lowest I/O wait time (only around 0.30%) compared to the other three queries.

114 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.43.: Average Resource Utilization of queries Q4, Q5, Q18 and Q27 on Hive/MapReduce for scale factor 1TB.

Query Q4 (Python Q5 (Ma- Q18 Q27 Streaming) hout) (OpenNLP) (OpenNLP) Average Runtime 928.68 272.53 27.6 0.7 (minutes): Avg. User 48.82% 51.50% 55.99% 10.03% CPU Uti- lization % System 3.31% 3.37% 2.04% 1.94% I/O wait 4.98% 3.65% 0.30% 1.29% Memory Utilization 95.99% 91.85% 90.22% 27.19% % Avg. Kbytes Trans- 7128.3 8329.02 2302.81 1547.15 mitted per Second Avg. Kbytes Re- 7129.75 8332.22 2303.59 1547.14 ceived per Second Avg. Context 11364.64 9859 6751.68 5952.83 Switches per Second Avg. Kbytes Read 3487.38 3438.94 1592.41 1692.01 per Second Avg. Kbytes Written 5607.87 5568.18 988.08 181.19 per Second Avg. Read Requests 47.81 67.41 4.86 14.25 per Second Avg. Write Requests 12.88 13.12 4.66 2.36 per Second Avg. I/O Latencies 115.24 82.12 20.68 8.89 in Milliseconds

Finally, Q27 shows that the system remains underutilized with only 10% CPU and 27% memory usage during the entire query execution. Further investigation into the query showed that it operated on a very small data set, which slightly varies with the increase of the scale factor. This fact together with the short execution time (just under a minute), render Q27 inappropriate for testing the data scalability and re-source utilization of a Big Data platform. It can be used in cases where functional tests involving the OpenNLP library are required.

Hive and Spark SQL Query Comparison

In this part three HiveQL queries (Q7, Q9 and Q24) are evaluated with the goal to compare the resource utilization of Hive and Spark SQL.

3.4 Performance Evaluation of Spark SQL using BigBench 115 First, we chose BigBench’s Q24 because it showed the worst data scaling behavior on Spark SQL. The query measures the effect of competitors’ prices on products’ in-store and online sales for a given product [Rab+14] (Compute the cross-price elasticity of demand for a given product).

Next, Q7 was selected as it sharply decreased its Hive to Spark SQL ratio with the increase of the data size, as depicted on Figure 3.54. BigBench’s Q7 lists all the stores with at least 10 customers who bought products with the price tag at least 20% higher than the average price of products in the same category during a given month [Rab+14]. It was adopted from query 6 of the TPC-DS benchmark [TPCa].

Finally, Q9 was chosen as it showed the highest Hive to Spark SQL ratio difference with the increase of the data size. BigBench’s Q9 calculates the total sales for different types of customers (e.g. based on marital status, education status), sales price and different combinations of state and sales profit [Rab+14]. It was adopted from query 48 of the TPC-DS benchmark [TPCa].

The average values of the measured metrics are shown in Table 3.44 and Table 3.45 for both Hive and Spark SQL together with a comparison represented in the Ratio (%) column. In addition to this, the figures in the appendix depict the resource utilization metrics (CPU utilization, network I/O, disk bandwidth and I/O latencies) in relation to the query’s runtime for Q7 (Figure F.4), Q9 (Figure F.5) and Q24 (Figure F.6) for both Hive and Spark SQL with 1TB data size.

Analyzing the metrics gathered for Q7, it is observable that the Spark SQL execution is only 13% faster than the Hive for 1TB data size, although for 100GB this difference was around 256%. This can be explained with the 3 times lower CPU utilization and the higher I/O wait time (around 21%) of Spark SQL. Also the average network I/O (around 3.4 MB/sec) of Spark SQL is much smaller than the one of Hive (11.6 MB/sec). Interestingly, the standard deviation of the three runs was around 14% for the 600GB data set and around 4% for the 1TB data set, which is an indication that the query behavior is not stable. Overall, the poor scaling and unstable behavior of Q7 can be explained with the join issue [15c] in the utilized Spark SQL version.

On the contrary, Q9 on Spark SQL is 6.3 times faster than Hive. However, Hive utilizes around 2 times more CPU time and has on average 2.7 times more context switches per second compared with Spark SQL. Both have very similar average net-work utilization (around 7.5 - 7.67 MB/sec).

Another interesting observation in both queries is that on one hand the average write throughput in Spark SQL is much smaller than its average read throughput. On the other hand, the average write throughput in Hive is much higher than its average read throughput. The reason for this is in the different internal architectures of the engines and in the way they perform I/O operations. It is also important to note that for both queries the average read throughput of Spark SQL is at least 2 times faster than the one of Hive. On the contrary, the average write throughput of Hive is at least 2 times faster than the one of Spark SQL. The reason for this inverse rate lies in the total data sizes that are written and read by both engines.

116 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.44.: Average Resource Utilization of queries Q7, Q9 and Q24 on Hive and Spark SQL for scale factor 1TB. The Ratio column is defined as HiveTime/SparkTime or Spark-Time/HiveTime and represents the difference between Hive (MapReduce) and Spark SQL for each metric.

Measured met- Q7 (HiveQL) Q9 (HiveQL) rics Hive Spark SQL Hive/ Hive Spark SQL Hive/ Spark Spark SQL SQL Ratio Ratio Average Runtime 46.33 41.07 1.13 17.72 2.82 6.28 (minutes): Avg. User 56.97% 16.65% 3.42 60.34% 27.87% 2.17 CPU Uti- lization % System 3.89% 2.62% 1.48 3.44% 2.22% 1.55 I/O wait 0.40% 21.28% - 0.38% 4.09% - Memory Utiliza- 94.33% 93.78% 1.01 78.87% 61.27% 1.29 tion % Avg. Kbytes Trans- 11650.07 3455.03 3.37 7512.13 7690.59 - mitted per Sec. Avg. Kbytes Re- 11654.28 3456.24 3.37 7514.87 7691.04 - ceived per Sec. Avg. Context 10251.24 8693.44 1.18 19757.83 7284.11 2.71 Switches per Sec. Avg. Kbytes Read 2739.21 6501.03 - 2741.72 13174.12 - per Sec. Avg. Kbytes Writ- 7190.15 3364.6 2.14 4098.95 1043.45 3.93 ten per Sec. Avg. Read Re- 40.24 66.93 - 9.76 48.91 - quests per Sec. Avg. Write Re- 17.13 12.2 1.4 10.84 3.62 2.99 quests per Sec. Avg. I/O Laten- 55.76 32.91 1.69 41.67 27.32 1.53 cies in Millisec.

Finally Q24 executed on Spark SQL is around 5.2 times slower than Hive and represents the HiveQL group of queries with unstable scaling behavior. On Hive, it utilizes on average 49% of the CPU, whereas on Spark SQL the CPU usage is on average 18%. However, for Spark SQL around 11% of the time is spent on waiting for out-standing disk I/O requests (I/O wait), which is much greater than the average for both Hive and Spark SQL. The Spark SQL memory utilization is around 2 times higher than the one of Hive. Similarly, the average number of context switches and

3.4 Performance Evaluation of Spark SQL using BigBench 117 Tab. 3.45.: Average Resource Utilization of query Q24 on Hive and Spark SQL for scale factor 1TB. The Ratio column is defined as HiveTime/SparkTime or Spark-Time/HiveTime and represents the difference between Hive (MapReduce) and Spark SQL for each metric.

Measured metrics Q24 (HiveQL) Hive Spark SQL Spark SQL/ Hive Ratio Average Runtime (minutes): 14.75 77.05 5.22 Avg. CPU Utilization % User 48.92% 17.52% - System 2.01% 1.61% - I/O wait 0.48% 11.21% 23.35 Memory Utilization % 43.60% 82.84% 1.9 Avg. Kbytes Transmitted per Second 3123.24 4373.39 1.4 Avg. Kbytes Received per Second 3122.92 4374.41 1.4 Avg. Context Switches per Second 7077.1 8821.01 1.25 Avg. Kbytes Read per Second 7148.77 7810.38 1.09 Avg. Kbytes Written per Second 169.46 3762.42 22.2 Avg. Read Requests per Second 22.28 64.38 2.89 Avg. Write Requests per Second 4.71 8.29 1.76 Avg. I/O Latencies in Milliseconds 21.38 27.66 1.29

the average I/O latency times of Hive are around 20% - 23% lower than that of the Spark SQL execution. In this case even the average write throughput of Spark SQL is much higher than the one of Hive. Analogous to Q7, the standard deviation of the three runs was around 8.6% for the 600GB data set and around 5% for the 1TB data set, which is a clear sign that the query behavior is not stable. Again the reason is the mentioned join issue [15c] in the utilized Spark SQL version.

3.4.6 Lessons Learned and Future Work

This paper presented the first results of our initiative to run BigBench on Spark. We started by evaluating the data scalability behavior of the current MapReduce BigBench implementation. The results revealed that a subset of the OpenNLP (MR) queries (Q19, Q10) scale best with the increase of the data size, whereas a subset of the Python Streaming (MR) queries (Q4, Q30, Q3) show the worst scaling behavior. Then we executed the 14 pure HiveQL queries on Spark SQL and compared their execution times with the respective Hive ones. We observed that both Hive and Spark SQL queries achieve on average better than linear data scaling behavior. Our analysis identified a group of unstable queries (Q7, Q16, Q21, Q22, Q23 and Q24), which were influenced by join issue [15c] in Spark SQL. For this queries, we observed a much higher standard deviation (4% - 20%) between the three executions even for the larger data sizes.

118 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Our experiments showed that for the stable pure HiveQL queries (Q6, Q9, Q11, Q12, Q13, Q14, Q15 and Q17), Spark SQL performs between 1.5 and 6.3 times faster than Hive. Last but not least, investigating the resource utilization of queries with different scaling behavior showed that the majority of evaluated MapReduce queries (Q4, Q5, Q18, Q7 and Q9) are memory bound. For queries Q7 and Q9, Spark SQL:

• Utilized less CPU, whereas it showed higher I/O wait time than Hive.

• Read more data from disk, whereas it wrote less data than Hive.

• Utilized less memory than Hive.

• Sent less data over the network than Hive.

The next step is to investigate the influence of various data formats (ORC, Parquet, Avro etc.) on the query performance. Another direction to extend the study will be to repeat the experiments on other SQL-on-Hadoop engines.

Acknowledgments

This work has benefited from valuable discussions in the SPEC Research Group’s Big Data Working Group. We would like to thank Tilmann Rabl (University of Toronto), John Poelman (IBM), Bhaskar Gowda (Intel), Yi Yao (Intel), Marten Rosselli, Karsten Tolle, Roberto V. Zicari and Raik Niemann of the Frankfurt Big Data Lab for their valuable feedback. We would like to thank the Fields Institute for supporting our visit to the Sixth Workshop on Big Data Benchmarking at the University of Toronto.

3.4 Performance Evaluation of Spark SQL using BigBench 119 3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance: A Study on ORC and Parquet

Abstract

Columnar file formats provide an efficient way to store data to be queried by SQL-on- Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, while Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

This study is joint work between Todor Ivanov and Matteo Pergolesi and was submitted on 26.09.2018 for review in the Journal Concurrency and Computation: Practice and Experience, Wiley.

3.5.1 Introduction

In the last years Hadoop has become the standard platform for storing and managing Big Data. However, the lack of skilled developers to write MapReduce programs has pushed the adoption of SQL dialects into the Hadoop Ecosystem in an attempt to ben- efit from the existing relational database skills, especially in the Business Intelligence and Analytics departments. Apache Hive [Thu+09; Thu+10] has emerged as the standard data warehouse engine on top of Hadoop. This adoption by the industry has lead the developer community to continuously work on improvements both in query execution path as well as in data storage strategies. For example [Hua+14] proposes an effective query planning, a vectorized query execution model and a new file format, called Optimized Record Columnar (ORC). Similarly, Parquet [18m; 18j], inspired by the Google Dremel paper [Mel+10], is another columnar file format.

Columnar file formats store structured data in a column-oriented way. The ad- vantages of the columnar over the row oriented formats [Hua+14; 18m] are the following:

• They make it efficient to scan only a subset of columns. If a query accesses only a few columns of a table, the I/O can be drastically reduced compared to traditional row oriented storage where you are forced to read entire rows.

120 Chapter 3 Evaluation of Big Data Platforms and Benchmarks • Organizing data by columns, chunks of data of the same type are stored sequentially. Encoding and compression algorithms can take advantage of the data type knowledge and homogeneity to achieve better efficiency both in terms of speed and file size.

Many SQL-on-Hadoop engines exist nowadays along with file formats designed to accelerate the data access and maximize the storage capacity. Tab. 3.46 summarizes the most popular SQL-on-Hadoop engines together with the data file formats they support. As shown in column Default File Formats each engine prefers different default file format with which it achieves its best performance. For example, ORC is favored by Hive [Thu+09; Thu+10] and Presto [Pre], whereas Parquet is first choice for SparkSQL [Arm+15b] and Impala [Kor+15]. A number of studies [FMÖ14; Wou+15; Che+14; Cos+16] have investigated and compared the performance of file formats running them on different SQL-on-Hadoop engines. However, because of the different internal engine architectures, these works actually compare the engine together with its file format optimizations. Contrary to this approach, in this work we compare ORC and Parquet file formats while keeping the processing engine fixed. Our main goal is not to tell which engine is better, but to understand how the overall performance of an engine is influenced by a change in the file format type or by a different parameter configuration.

Fig. 3.55 shows a graphical representation of our research objective. This study in-

Processing Engine [Hive | Spark SQL]

Configuration: File format: What is the 1. ORC best 1. Default 2. Parquet performing 2. No compression combination? 3. Compression

Fig. 3.55.: Graphical representation of the proposed benchmarking approach. vestigates the performance of the ORC and Parquet file formats first in Hive and then in Spark SQL. It also shows how tuning accordingly the file format configurations can influence the overall performance. We perform a series of experiments using the standard BigBench (TPCx-BB) benchmark [Gha+13b] with a dataset size of 1000 GB, comparing different ORC and Parquet configurations.

The contributions of this work are as follows:

• performance evaluation of ORC and Parquet file formats with their default configuration on both Hive and Spark SQL engines

• performance comparison of ORC and Parquet file formats with two optimized configurations (respectively with and without data compression) in Hive and Spark SQL

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 121 Tab. 3.46.: Popular SQL-on-Hadoop Engines

Engines Default File Format Other Supported File Formats Hive [Thu+10; Text, Sequence File, RCFile, ORC Thu+09] Parquet, Avro Spark SQL Text, JSON, Sequence File, RCFile, Parquet [Arm+15b] ORC, Avro Impala [Kor+15] Parquet Text, Sequence File, RCFile, Avro Sequence File, RCFile, ORC, Pig [Ols+08b] Pig Text File Parquet, Avro Text, JSON, Sequence File, Drill [HN13] None MapR-DB, Parquet, Avro Text, JSON, Sequence File, RCFile, Presto [Pre] ORC ORC, Parquet JSON, Sequence File, RCFile, ORC, Tajo [Cho+13a] Text File Parquet HAWQ [Cha+14] AO, CO, Parquet PXF, Text, RCFile, Avro, Hbase Sequence File, RCFile, ORC, IBM BigSQL [18s] Text File Parquet, Avro Hbase, Spark RDD and Phoenix [18k] CSV, JSON DataFrames ADM (super-set of AsterixDB [Als+14] ADM, CSV JSON) Vertica [18ai] internal raw formats ORC, Parquet AWS Athena/Presto ORC ORC, Parquet, CSV, JSON, Avro [18a]

• investigate the influence of data compression (Snappy) on the file format performance

• detailed query analysis of representative BigBench queries

Our experiments confirm that the file format selection and its configuration signifi- cantly affect the overall performance. We show that ORC generally performs better on Hive, while Parquet achieves best performance with Spark SQL. Using ZLIB com- pression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

The rest of the section is organized as follows: Subsection 3.5.2 gives background information and related work; Subsection 3.5.3 presents the experimental setup and the preparation stages; Subsection 3.5.4 discusses benchmark results for Hive and Subsection 3.5.5 discusses benchmark results for SparkSQL; Subsection 3.5.6 presents an in-depth analysis of a small subset of representative queries. Finally, Subsection 3.5.7 summarizes our results.

122 Chapter 3 Evaluation of Big Data Platforms and Benchmarks 3.5.2 Background and Related Work

In this section we briefly introduce the main technologies and terms. Also we present a summary of the most relevant studies investigating data file formats in SQL-on-Hadoop systems.

Hive

Apache Hive [Thu+09; Thu+10] is a data warehouse infrastructure built on top of Hadoop. Hive was originally developed by Facebook and supports the analysis of large data sets stored on HDFS by queries in a SQL-like declarative query language, called HiveQL. It does not strictly follow the SQL-92 standard. Additionally, natively calling User Defined Functions (UDF) in HiveQL allows to filter data by custom Java or Python scripts. Plugging in custom scripts in HiveQL makes the implementation of natively unsupported statements possible. Hive consists of two core components: the driver and the Metastore. The driver is responsible for accepting HiveQL state- ments, submitted through the command-line interface (CLI) or the HiveServer, and translating them into jobs that are submitted to the MapReduce engine [Thu+10]. This allows users to analyze large data sets without actually having to develop MapReduce programs themselves. The Metastore is the central repository for Hive’s metadata and stores all information about the available databases, tables, table columns, column data types and more. The Metastore uses typically a traditional RDBMS like MySQL to persist the metadata.

Spark SQL

Apache Spark [Zah+10] is a cluster computing system, that is able to run batch and streaming analysis jobs on data distributed on the cluster. Spark SQL [Arm+15b] is one of the many high level tools running on top of a Spark cluster. It facilitates the processing of structured data by offering a SQL-like interface and support for HiveQL and Hive UDFs. To achieve this it defines the concept of DataFrame, a collection of structured records, and a declarative API to manipulate it. Spark SQL includes a specific query optimizer, Catalyst, that improves computation performance thanks to the available information about the data structure. Data can be queried from multiple sources, among which the Hive catalog.

Columnar File Formats

Apache ORC [18i; Hua+14] and Apache Parquet [18j] are the most popular and widely used file formats for Big Data analytics and they share many common concepts in their internal design and structure. In this section we will present the main aspects of columnar file formats in general and their purpose in optimizing query execution. Fig. 3.56 will support the explanation.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 123 A B C D block1 block2 ... blockN Footer a1 b1 c1 d1

a2 b2 c2 d2

......

Row group 1 Row group 2 aN bN cN dN

Column A Column B Column C Column D Column A Column B Column C Column D Chunk 1 Chunk 1 Chunk 1 Chunk 1 Chunk 2 Chunk 2 Chunk 2 Chunk 2

page page page page page page page page page page

Fig. 3.56.: General structure of a columnar file.

The upper right part of Fig. 3.56 shows a table example with four columns (A, B, C and D). Each column has a different color to better understand the column-oriented storage pattern later. A table is stored in one or more HDFS files, composed of one or more file system blocks. Table rows are partitioned into groups and stored in the blocks, as shown in the left upper part of Fig. 3.56. Row groups are data units independent from each other and used to split the input data for parallel processing. Data in row groups is not stored row by row, but column by column. Looking at the first row group detail in Fig. 3.56, we can see that data values for Column A (in red) are stored first, next to it come data values for Column B (in purple) and so on. These portions of column data are usually called column chunks and they allow the filtering of unneeded columns by the query while reading the file. Finally, data in column chunks is split into pages which are the indivisible unit for encoding and compression. Because values in a column chunk share the same data type, encoding algorithms can achieve a more efficient representation. The encoded output is then compressed with a generic compression algorithm (like ZLIB [18ak] or Snappy [18ab]).

While the column-oriented storage strategy is used to filter out unnecessary column data for a query, indexes are used to filter out row groups. They are usually placed in front of each row group, so that by just reading the index, an entire row group can be immediately skipped. This is only possible when the query execution engine uses predicate push-down optimization [Bra18]. Indexes are not shown in Fig. 3.56 for the sake of simplicity.

Tab. 3.47 and Tab. 3.48 report all the mentioned general concepts and their specific implementation name for each file format. Furthermore, parameter name and default value for the Hive engine are shown. We do not report parameters for Spark in the tables because our benchmarking platform (BigBench) generates the dataset with Hive and the file format configuration happens in Hive settings.

124 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Optimized Record Columnar (ORC) File

Apache ORC [18i; Hua+14] is a self-describing (includes metadata), type-aware columnar file format designed initially for Hadoop workloads, but now used as a general purpose storage format. It is optimized for large streaming reads and has many advantages over its predecessor, the RCfile format [He+11].

Tab. 3.47.: ORC design concepts and default configuration.

Concept Name Hive Configuration Default Group of rows stripe hive.exec.orc.default.stripe.size 67,108,864 Bytes Index index orc.create.index (not available in true Hive) 10,000 rows hive.exec.orc.default.row.\\ index.stride Portion of col- stream umn Page compression hive.exec.orc.default.buffer.size 262,144 Bytes chunk Encoding encoding hive.exec.orc.encoding.strategy SPEED Compression compression hive.exec.orc.default.compress ZLIB

Tab. 3.47 shows the ORC implementation of general design concepts we talked about in Section 3.5.2. A group of rows is called a stripe in ORC. The size is configurable and defaults to 64MB. Stripes are explicitly separated from each other by an index (placed in front) and a footer. A stripe contains portions of the table columns, which are called streams and each stream is divided into pages, called compression chunks. Pages are sized 256KB by default.

The index contains row statistics and the minimum and maximum value for each column stream. An index can be stored into more pages and a page contains information about 10,000 rows by default. Also a bloom filter is included for better row filtering. Finally, a top level index is placed in the file footer.

ORC supports the complete set of data types available in Hive, including the complex types: structs, lists, maps, and unions. Numerical columns are encoded using Run- Length Encoding (RLE) and it is possible to select between the SPEED (default) or COMPRESSION strategy. Dictionary encoding is applied when possible to strings. The latter makes the encoding more lightweight, anyway this has nothing to do with the generic compression algorithm, which is applied on the encoding output. Default compression algorithm is ZLIB [18ak].

The metadata in ORC is stored at the end of file (after the file footer) using the Protocol Buffers [18z], providing the ability to add new fields to the table schema without breaking readers.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 125 Parquet

Apache Parquet [18j] is an open source columnar storage format using complex nested data structures inspired by the Google Dremel paper [Mel+10]. It is a general purpose storage format that can be used or integrated with any data processing framework or engine. Parquet supports efficient compression and encoding schemas on a per-column level. It uses [18l] for the metadata definitions. There are three types of metadata: file metadata, column (chunk) metadata and page header metadata.

Tab. 3.48 shows the Parquet implementation of general design concepts we talked about in Section 3.5.2. The row group concept keeps the same name in Parquet documentation, while it is called block in Hive configuration and has a default size of 128MB. Differently from ORC, row groups are not explicitly separated from each other. Also for column chunks, Parquet keeps using the general name. Pages composing a column chunk are called data pages, to distinguish them from dictionary pages used as indexes. The default size of each data page is 1MB, 4 times larger than ORC compression chunks.

Tab. 3.48.: Parquet design concepts and default configuration.

Concept Name Hive Configuration Default Group of rows row group parquet.block.size 134,217,728 Bytes Index dictionary parquet.enable.dictionary true page parquet.dictionary.page.size 1,048,576 Bytes Portion of col- column umn chunk Page data page parquet.page.size 1,048,576 Bytes Encoding encoding parquet.enable.dictionary true Compression compression parquet.compression uncompressed

Column chunk data is paired with metadata that includes a dictionary page, a compact representation of column values. The dictionary pages are useful to filter out unnecessary data for the query. Dictionary page size is customizable and it defaults to 1MB, the same default value used for data pages.

At the end of the file, metadata describing the file structure are stored. File metadata contain references to all of the column chunk metadata start locations to easily access them. Furthermore, it allows to immediately filter out columns not needed by the query.

For encoding, Parquet uses a dictionary encoding when applicable on text data. RLE or bit-packing are used for numerical values. The selection happens automatically when writing data to a Parquet file. A further generic compression algorithm can be applied on encoded data. In the default Hive configuration, the Parquet data is not compressed.

126 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Related Work

In recent years, multiple studies evaluated and compared the different SQL-on- Hadoop engines along with the file formats they support. A recent SQL-on-Hadoop Tutorial [Aba+15] at VLDB 2015, reviews in detail the architectures of the most popular ones. ORC and Parquet are also listed as the most widely used file formats.

Performance Comparisons. Tab. 3.49 summarizes related work that evaluate SQL-on-Hadoop engines with ORC and Parquet.

Tab. 3.49.: Summary of related work.

Citation Engines Benchmark File Formats Configurations Floratou et al. Hive v0.12 TPC-H ORC Text [FMÖ14] Hive-Tez v0.13 TPC-DS Parquet Default Impala v1.2.2 Snappy Chen et al. Hive v0.10 TPC-DS ORC Text [Che+14] HiveStinger v0.12 Parquet Default Shark v0.7.0 Impala v1.0.1 Presto v0.54 Wouw et al. Hive v0.12 CALDA Sequence file Snappy [Wou+15] Impala v1.2.3 real world Shark v0.8.1 dataset Costea et al. Hive v1.2.1 TPC-H ORC ORC+Snappy [Cos+16] VectorH v5.0 Parquet (Hive) Impala v2.3 VectorH Parquet+Snappy HAWQ v1.3.1 (Impala, HAWQ, Spark SQL v1.5.2 SparkSQL) VectorH+LZ4 Poggi et al. Hive v1.2.1 TPC-H ORC Default [Pog+16] Hive+Tez v1.2.1 Poggi et al. Hive+Tez v1.2-2.1 BigBench ORC Default [PMC17] Spark+SQL+MLlib (TPCx-BB) v1.6-2.1 Pouria et al. Hive+Tez v1.2 TPC-H ORC (Hive) Default [PCW17] AsterixDB v0.8.9 Parquet Spark v1.5 (SparkSQL)

Floratou et al. [FMÖ14] compare the performance of ORC and Hive with the one of Parquet and Impala using TPC-H [18ah] and TPC-DS [18af] queries. The results show that Impala is 3.3× to 4.4× faster than Hive on MapReduce and 2.1× to 2.8× faster than Hive on Tez for the TPC-H experiments. For the TPC-DS inspired experiments, Impala 8.2× to 10× faster than Hive on MapReduce and about 4.3× faster than Hive on Tez. The results also show that Parquet skips data more efficiently than the ORC format, which tends to prefetch unnecessary data especially when

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 127 a table contains a large number of columns. However, the built-in index in ORC format mitigates that problem when data is sorted.

Similarly, Chen et al. [Che+14] compare multiple SQL-on-Hadoop engines using modified TPC-DS queries on clusters with varying number of nodes. In terms of storage formats, they use the default ORC and Parquet configuration parameters. The results show that overall Impala and Shark are the fastest followed by Presto, Hive and Stinger. Also Shark and Impala perform better on small datasets and Hive, Stinger and Shark are sensitive to data skewness.

Another work by Wouw et al. [Wou+15] present a new benchmark with real and synthetic data, and compare Shark, Impala and Hive in terms of processing power, resource utilization and scalability. The results do not show a clear winner in terms of performance, but Impala and Shark have similar behavior. In terms or resource consumption, Impala is the most CPU efficient and has slightly less disk I/O than Shark and Hive.

Costea et al. [Cos+16] introduce the VectorH engine as a new SQL-on-Hadoop system on top of Vectorwise and compare it with similar engines using TPC-H. In the experiments they use ORC and Parquet with Snappy compression. The experiments show that VectorH performs 1-3 orders of magnitude faster than Impala (with Parquet), Hive (with ORC), HAWQ (with Parquet) and Spark SQL (with Parquet), thanks to the multiple optimizations introduced in the paper.

A recent work by Poggi et al. [Pog+16] evaluate the Hive on MapReduce and Hive on Tez (with default ORC format configuration) performance on multiple cloud providers using the TPC-H benchmark. The results show that the price-to- performance ratio for the best cloud configuration is within a 30% cost difference for the 1TB scale. The same team [PMC17] evaluated different cloud providers and their Hive+Tez/Hive+MR as well as Spark offerings using the BigBench (TPCx-BB) benchmark. All experiments were performed using data stored in ORC format and default configurations unless there were specific execution problems. The results showed that Hive-on-Tez performs up to 4x better than Hive-on-MapReduce. Hive- on-Tez is also faster than Spark 2.1 on the lower scale factors, but this difference narrows down for the larger data sizes. However, it is not possible to draw any conclusions if ORC or Parquet influenced the overall performance.

Last but not least, Pirzadeh et al. compare different SQL-on-Hadoop engines (Hive, Spark SQL, AsterixDB and a commercial parallel relational database) stressing them with TPC-H and storing the underlying data in various file formats. Similar to other studies, the authors compare SparkSQL with text data and Parquet, Hive-on-MR and Hive-on-Tez with ORC amd AsterixDB with normalized data and nested data. The results show that using optimized columnar file formats such as ORC and Parquet significantly improved the performance.

Optimizations. Recently, many new techniques for optimizing the performance and efficiency of analytical workloads on top of columnar storage formats have been proposed. Bian et al. [Bia+17] focus on improving the data scan throughput by finding an optimal column layout. By applying efficient column ordering they reduce

128 Chapter 3 Evaluation of Big Data Platforms and Benchmarks the end-to-end query execution time by up to 70% and respectively by additional 12% when using column duplication with less than 5% extra storage.

A different approach of reducing data access through data skipping was presented by Sun et al. [Sun+14a; Sun+14b]. The authors introduce a four-step framework (workload analysis, augmentation, reduce and partitioning phases) for data skipping by applying more effective partitioning schema that takes into account the query filters. The results show 3-7x improvements in the query response time compared to the traditional range partitioning. In their latest work, Sun et al. [Sun+16] present a novel hybrid data skipping framework that optimizes the overall query performance by automatically balancing skipping effectiveness and tuple-reconstruction overhead. It allows both horizontal and vertical partitioning of the data, which maximizes the overall query performance.

At the same time new file formats utilizing the capabilities of emerging storage components like Non-Volatile Memory (NVMe) devices were introduced. [18c] provides a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. Trivedi et al. [Tri+18] introduced a new high-performance file format optimized for NVMe devices that achieves up to 21.4x performance gains. The authors integrate it with SparkSQL and show up to 3x query accelerations with TPC-DS.

However, none of the above studies investigates the similarities and differences of both formats using a common SQL-on-Hadoop engine as a baseline. Additionally, most of the above comparisons were based on benchmarks using structured data such as the TPC-H [18ah] and TPC-DS [18af]. In our study, we use BigBench which operates on structured, semi-structured and unstructured data and has been standardized as TPCx-BB [TPC18b] by the TPC committee.

3.5.3 Experimental Setup

This section describes the hardware and software components, and the different configurations used in our experiments.

Hardware Configuration

The experiments were performed on a cluster consisting of 4 nodes connected directly through a 1GBit Netgear switch. All 4 nodes are Dell PowerEdge T420 servers. The master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs each with 6 cores, 32GB of main memory and 1TB hard drive. The 3 worker nodes are equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and 4x 1TB (SATA, 7.2K RPM, 64MB Cache) hard drives. More detailed specification of the node servers is provided in the Appendix B (Table B.1 and Table B.2).

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 129 Software Configuration

All four nodes in the cluster are equipped with Ubuntu Server 14.04.1 LTS as the operating system. On top of that we installed the Cloudera Distribution of Hadoop (CDH) version 5.11.0, which provides Hadoop, HDFS, and Yarn all at version 2.6.0 and Hive 1.1.0. Separately, we installed Spark 2.3.0 and we configured Spark SQL to work with Yarn and Hive catalog. The total storage capacity is 13TB of which 8TB are effectively available as HDFS space. Due to the resource limitation (only 3 worker nodes) of our setup, the cluster was configured to work with replication factor of two. Tab. 3.50 show relevant cluster parameters and how they were adjusted for the experiments.

Tab. 3.50.: Cluster configuration.

Component Parameter Configuration Value YARN yarn.nodemanager.resource.memory-mb 31GB yarn.scheduler.maximum-allocation-mb 31GB yarn.nodemanager.resource.cpu-vcores 11 Spark master yarn num-executors 9 executor-cores 3 executor-memory 9GB spark.serializer org.apache.spark.\ serializer.KryoSerializer MapReduce mapreduce.map.java.opts.max.heap 2GB mapreduce.reduce.java.opts.max.heap 2GB mapreduce.map.memory.mb 3GB mapreduce.reduce.memory.mb 3GB Hive Client Java Heap Size 2GB

BigBench

In order to perform an extensive evaluation of the file formats, it was necessary to use a Big Data benchmark that is able to evaluate the 3Vs characteristics and utilizes structured, semi-structured, and unstructured data. BigBench [Gha+13b; Bar+14], has been proposed and developed to address exactly this needs. It is an end-to-end analytics, application level and technology agnostic Big Data benchmark that was recently adopted by TPC and released as the TPCx-BB [TPC18b] benchmark. Chowdhury et al. [Cho+13b] presented a BigBench implementation for the Hadoop ecosystem, which is available on GitHub [18n] and was used for our experiments. The data set is generated on a fictitious product retailer business model, while the workload consists of 30 complex queries. 10 queries were taken from the TPC- DS benchmark [18af], whereas the remaining 20 queries were adapted from the McKinsey report [Man+11]. Tab. 3.51 summarizes the number and type of queries

130 Chapter 3 Evaluation of Big Data Platforms and Benchmarks of this BigBench implementation. All queries are implemented using Apache Hadoop [18d], Hive [Hiva], Mahout [18g] and the open Natural Language Processing toolkit [18h]. [18g] is a library for quickly creating scalable performing machine learning applications on top of MapReduce, Spark and similar frameworks. Apache OpenNLP [18h] is a library toolkit for machine learning processing of natural language text.

Tab. 3.51.: BigBench query types.

Query Types Queries Number of Queries Pure HiveQL 6,7,9,11,12,13,14, 14 15,16,17,21,22,23,24 MapReduce /UDTF 1 1 MapReduce /Python 2,3,4,8,29,30 6 HiveQL /Spark MLlib 5,20,25,26,28 5 MapReduce /OpenNLP 10,18,19,27 4

File Format Configurations

One of the goals of this study is to investigate how changing the configuration parameters of ORC and Parquet influence their performance. Therefore, defining the exact configuration parameter values was the first very essential step before starting the experiments.

As shown in Section 3.5.2, the two file formats share many concepts in their structure design. Our goal is to setup ORC and Parquet with a similar configuration, so to be able to meaningfully compare their performance. We define three test configurations reported in Tab. 3.52 and we focus on three parameters: the row group size, the page size and the compression algorithm. All other parameters are set to their default values for all the three test configurations. As expected, the use of indexes in ORC and of dictionary pages in Parquet is enabled by default. We set the HDFS block size to 256MB for all of our tests.

The first test configuration is called Default Config and uses the default ORC and Parquet parameters as stated in each file format documentation. The two formats have very different configurations, especially in the compression parameter. ORC uses ZLIB compression and Parquet does not use any compression, which makes the benchmark results not comparable. Nevertheless, we decided to keep this setup to show the file formats behavior when using them "out-of-the-box" and to highlight the performance change after an optimized parameter configuration.

The two other configurations, named Snappy Config and No Compression Config, use respectively the same value for row group and page size while the compression algorithm changes. We increase the row group size to 256MB to have more sequential reads from disk at the expense of a higher memory usage [Hua+14; Bia+17]. The page size is set to 1MB (default for Parquet). A larger page size improves the

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 131 Tab. 3.52.: File format configurations.

Format Parameters Default Config Parquet parquet.block.size 128MB parquet.page.size 1MB parquet.compression uncompressed ORC hive.exec.orc.default.stripe.size 64MB hive.exec.orc.default.buffer.size 256KB hive.exec.orc.default.compress zlib Format Parameters No Compression Config Parquet parquet.block.size 256MB parquet.page.size 1MB parquet.compression uncompressed ORC hive.exec.orc.default.stripe.size 256MB hive.exec.orc.default.buffer.size 1MB hive.exec.orc.default.compress uncompressed Format Parameters Snappy Config Parquet parquet.block.size 256MB parquet.page.size 1MB parquet.compression snappy ORC hive.exec.orc.default.stripe.size 256MB hive.exec.orc.default.buffer.size 1MB hive.exec.orc.default.compress snappy

compression performance and decreases overhead again at the expense of a higher memory usage.

The compression parameter is also very important as it determines which general- purpose algorithm such as Snappy, ZLIB or LZO is used after the file format encoding [Hua+14]. Using compression, file readers perform less I/O operations to get the data from disk but more CPU cycles are spent to decompress them. In our configurations we use the Snappy compression, which is supported by both ORC and Parquet.

Many parameter combinations are possible and would lead to interesting research questions. For example, how the performance for the two file formats change if we vary the row group and page size from 64MB to 256MB in steps of 16MB? Due to time constraints, we limit our tests to these three configurations to achieve a meaningful comparison of the two file formats when running queries on a fixed engine.

132 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Engine Configuration

To fully take advantage of columnar file formats design it is important to configure some settings in the query execution engine. It is important to notice that these parameters do not affect the file generation, but only the query processing.

Code 3.1: SparkSQL query execution settings. 1 spark..parquet.filterPushdown true 2 spark.sql.parquet.recordLevelFilter.enabled true 3 spark.sql.hive.convertMetastoreParquet true 4 spark.sql.orc.filterPushdown true

For SparkSQL, we edit spark-default.conf file by adding the lines shown above [18ae; Blu16]. The first group of parameters is relevant for Parquet file format. Lines 1 and 2 enable full support for predicate push-down optimizations. Parameter at line 3 enables the usage of Parquet built-in reader and writer for Hive tables, instead of SerDe [18aa]. Finally, line 4 in the previous file snippet enables predicate push-down also for ORC.

Code 3.2: Hive query execution settings. 1 set hive.optimize.ppd = true ; 2 set hive.optimize.ppd.storage = true ; 3 set hive.ppd.recognizetransivity = false ; 4 set hive.optimize.index.filter = true ;

BigBench uses custom settings for Hive, overriding the default [18q]. We set the parameters shown above in engineSettings.sql file placed in BigBench Hive subfolder. Lines 1-3 enable the predicate push-down optimization. Line 4 enables the use of format specific indexes for both ORC and Parquet.

Load Times and Data Sizes

The last step before starting the experiments is to generate data using the BigBench data generator based on the PDGF [Rab+10b]. It generates data in text format that needs to be loaded and converted to a specific file format. The loading process can greatly vary in time depending on the compression used by both ORC and Parquet. BigBench relies only on Hive for creating the schema in the Metastore and storing the file formats data into HDFS. Fig. 3.57 displays the loading times (in minutes) versus the dataset sizes (in GB) for all configurations in Hive version 1.1.0 with scale factor 1000 (1TB). However, the size of the generated data even with No Compression Config. is much smaller (around 600GB) than 1TB due to the columnar file format optimizations applied on the stored data (Section 3.5.2). The detailed summary of all results is available on GitHub [18o] under file Loading-times.xlsx.

To better compare measurements in Fig. 3.57, we divide it in four equal quadrants by drawing a line at 350 GB on the Y-axis and 150 minutes on the X-axis. There is an obvious trade-off between time taken to generate the data and the size of the generated data. The points in the down-left quadrant will be the one with optimal

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 133 Fig. 3.57.: Hive Load Time (min.) on the X-axis and Data Size (GB) on the Y-axis for Scale Factor 1000.

performance, which in our case is empty. Therefore, the ORC Default configuration (using ZLIB compression) in the down-right quadrant has the best performance in terms of time and data size, followed by the ORC Snappy configuration with the fastest time generation. Parquet achieves its best performance with the Snappy configuration also in the down-right quadrant.

Performance Evaluation

Our plan is to run experiments with all the 30 BigBench queries with the scale factor 1000 of BigBench (1000 GB of data). All tests are repeated three times and we measure the execution time of each query. The averaged values are taken as a representative number. In order to compare the three different configurations defined in Tab. 3.52, we need to repeat the three runs for each configuration for both ORC and Parquet. To better compare the file formats and understand the different support on multiple processing engines, we perform the experiments on two popular engines, Hive and Spark SQL. Tab. 3.53 summarizes all experimental runs to give an idea about the time overhead for our study. The Average Total Execution Time per Run is the time needed to run all the 30 BigBench queries. We report this time in hours for Hive and Spark SQL. As stated before, the tests are repeated three times for each combination of file format type, configuration and processing engine. By summing and multiplying, we obtain the Total Execution Time (in hours) needed for each of our configurations. It is reported on the last line of Tab. 3.53. By summing all total execution times for the experiments, we get around 701 hours ( 29.2 days) of testing time.

134 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.53.: Experimental roadmap.

Configuration Default No Compression Snappy File Format ORC Parquet ORC Parquet ORC Parquet BigBench Queries 30 30 30 Average Total Execution Time per Run (hours) 24 26 24 26 24 25 on Hive Average Total Execution Time per Run (hours) 12 15 15 14 15 14 on SparkSQL Number of Runs 3 3 3 Total Execution Time 108 124 116 120 116 117 (hours)

We defined two metrics to help us compare the different file format configurations. The first metric is called Performance Improvement (PI%) and the second one is called Compression Improvement (CI%). When comparing two execution times we define the higher time HT (worst) as the baseline and we compute the time difference with the lower time LT (better) as a percentage. This is the amount of saved time between the two executions and we define it as Performance Improvement (PI%):

LT · 100 PI% = − 100 (3.8) HT

Similarly, we define the Compression Improvement (CI%) as follows:

NoCompressionT ime · 100 CI% = − 100 (3.9) CompressionT ime

The Performance Improvement (PI%) metric reports with how many % a file format configuration is faster on ORC compared to Parquet or vice verse. It compares the execution times of the different file formats, whereas Compression Improvement (CI%) compares the improvement on a particular file format (Parquet or ORC) when using data compression. The CI% can be negative when the baseline configuration without any compression performs faster than the configurations with compressions (e.g. Snappy and ZLIB).

3.5.4 Hive Results

In the next subsections we show results for the Hive processing engine. Each table is dedicated to a query set identified by the query type as described in Tab. 3.51. Note that we merged the single MapReduce/UDTF query with MapReduce/Python group. The first column reports the query number, while the following columns report

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 135 execution times for each combination of configuration (Tab. 3.52) and file format type. The Total Time of query execution is shown at the end of the table, while the very last two rows report the PI% and CI%. The latter is used to show the overall improvement achieved by using data compression, while PI% shows the overall performance difference between the two file formats for each configuration.

Query execution time is reported in minutes and green cells highlight the best time between file formats within each of our configurations (i.e., Default, No compression, Snappy). Pairs of cells with the same color (white or light gray) are used to show sim- ilar performance between the two file formats: we consider query performance comparable/equal if the difference is lower or equal to 1 minute. The standard deviation in % between the 3 query executions for all Hive queries is under 5%, which indicates that Hive achieves very stable execution behavior.

Furthermore, we discuss the results from the perspective of a practitioner who runs BigBench as a black-box benchmark and analyses the results. This is also the motivation to include the Default file format configurations in our tests. We want to show what kind of performance you get by using file formats out-of-the-box and how adjusting the configuration parameters can influence the system behavior.

Pure HiveQL

Tab. 3.54 shows results for the Pure HiveQL query type. These queries are fully implemented in HiveQL and unlike other query types, do not use UDF (User Defined Functions) or external libraries.

Looking at the Default configuration column, we can observe a clear pattern with ORC generally performing better. Only Q15, Q16 and Q22 show similar performance. For a naive user this can be considered as an expected behavior since, as shown in Tab. 3.46, ORC is the default file format in Hive and we expect better optimization with this engine. However, the file formats are configured very differently with ORC using ZLIB compression and a bigger block size, while Parquet is not using compression at all (Tab. 3.52). This comparison can then lead to misleading conclusions.

To verify the previous results we can observe the No compression and Snappy columns. In these two configurations we tried to set ORC and Parquet parameters to be as similar as possible, so that the performance comparison is more accurate. Observing the No compression column, we can see that the previous pattern with ORC as a winner is confirmed, although the performance difference between the two file formats is reduced. As shown in Tab. 3.54, from a ∼14.8% PI on Default configuration we move to a ∼9.5% PI. Q06 and Q11 are added to the group of queries showing similar performance.

Moving to the Snappy column, we can observe a general performance improvement for both file formats. This is shown by comparing the Total Time cells. Blue cells show the baseline and CI for ORC (∼3.7%), while orange cells do the same for Parquet (∼5.6% CI). When using compression, the better performance of ORC is again confirmed. Still, Q09 shows better performance on Parquet with similar

136 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.54.: Hive Results for Pure HiveQL query type.

Query execution time is reported in minutes. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q06 40 < 44 43 ≈ 44 41 < 43 Q07 30 < 34 32 < 34 30 < 34 Q09 16 < 20 16 < 21 16 ≈ 16 Q11 12 < 14 14 ≈ 14 13 ≈ 14 Q12 26 < 35 29 < 35 27 < 30 Q13 27 < 33 30 < 33 28 < 31 Q14 6 < 9 7 < 9 5 < 9 Q15 4 ≈ 5 4 ≈ 5 4 ≈ 4 Q16 62 ≈ 62 62 ≈ 62 62 ≈ 61 Q17 15 < 18 15 < 18 14 < 17 Q21 20 < 25 21 < 25 20 < 24 Q22 28 ≈ 28 28 ≈ 28 31 > 28 Q23 8 < 12 9 < 13 8 < 11 Q24 20 < 26 23 < 26 21 < 25 Total Time 312 < 366 332 < 367 320 < 347 (min.) Performance Improvement 14.8 % ←− 9.5 % ←− 7.8 % ←− (PI) % Compression base- Improvement 6.3 % baseline baseline 3.7 % 5.6 % line (CI) %

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 137 performance to ORC and Q22 seems to get worse performance on ORC when adding Snappy compression to the configuration.

Finally, we can state that ORC is the winning file format for the Pure HiveQL query type on the Hive engine. Furthermore, using Snappy data compression slightly improves the performance of both file formats.

MapReduce/Python

Tab. 3.55 shows results for the MapReduce/Python query type. The queries are implemented in HiveQL and enriched with Python programs to execute complex operations within the query definition.

Tab. 3.55.: Hive Results for MapReduce/Python query type.

Query execution time is reported in minutes. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q01 12 < 18 14 < 17 13 < 16 Q02 170 < 183 172 < 184 171 < 180 Q03 76 < 91 79 < 90 76 < 85 Q04 155 < 171 156 < 172 155 < 167 Q08 41 < 49 44 < 49 42 < 45 Q29 38 < 40 40 ≈ 40 39 ≈ 40 Q30 250 < 259 250 < 258 249 < 253 Total Time 743 < 811 755 < 811 746 < 786 (min.) Performance Improvement 8.4 % ←− 6.9 % ←− 5.1 % ←− (PI) % Compression base- Improvement 1.7 % baseline baseline 1.3 % 3.1 % line (CI) %

Starting from the Default column we can observe again that ORC is performing better than Parquet for all the queries. Only, Q29 shows a difference of two minutes, but similar performance on the other configurations. We believe that this small difference is only due to noise in the execution time and we conclude that Q29 acts similarly in spite of any change in the file format configuration. The table shows a ∼8.4% PI between ORC and Parquet with the Default configuration.

138 Chapter 3 Evaluation of Big Data Platforms and Benchmarks The performance behavior is confirmed on the No compression and Snappy columns, where the PI is reduced respectively to ∼6.9% and ∼5.1%. Both file formats benefit from a small performance improvement thanks to the introduction of compression. This is highlighted in the blue and orange cells at the end of Tab. 3.55.

HiveQL/OpenNLP

Tab. 3.56 shows results for the HiveQL/OpenNLP query type. The queries are implemented in HiveQL and enriched with a Java UDF to process natural language data (i.e., product reviews).

Tab. 3.56.: Hive Results for HiveQL/OpenNLP query type.

Query execution time is reported in minutes. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q10 33 > 18 23 > 19 30 ≈ 30 Q18 50 > 45 46 ≈ 47 48 < 50 Q19 13 > 9 10 ≈ 9 12 ≈ 12 Q27 2 ≈ 1 2 ≈ 1 2 ≈ 2 Total Time 97 > 73 81 77 92 94 (min.) Performance Improvement −→ 24.7 % −→ 4.9 % 2.1 % ←− (PI) % Compression -17.3 base- -12.0 Improvement baseline baseline -18.4 % % line % (CI) %

Looking at the Default column we observe an opposite performance pattern with respect to the previous query types. Parquet shows considerably better performance than ORC for Q10, Q18 and Q19. Q27 takes a very short time, making it hard to spot any significant performance difference across the various configurations. The PI for Default configuration is ∼24.7%.

To understand this unexpected behavior, it is useful to look at query performance when the two file formats have a similar configuration. With the No compression configuration, the PI resulting from the usage of Parquet decreases to ∼4.9%. All the queries show similar performance, with the exception of Q10 that takes a considerably shorter time on Parquet. The results reported in the Snappy column highlight a very similar performance behavior between the two file formats. Q18 seems to perform better on ORC here, but with a very small difference of about two minutes. We can conclude that the performance is not greatly influenced by the

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 139 file format because, if configured similarly, ORC and Parquet show a comparable behavior.

It is interesting to compare the performance between No compression and Snappy. For the OpenNLP query type, the introduction of the Snappy data compression does not bring any benefit, but instead it worsens the performance. As reported in the very last line of Tab. 3.56, both ORC (blue cells) and Parquet (orange cells) show a negative and consistent CI. This observation finally explains the unexpected performance pattern reported in the Default column. In fact, Parquet uses no compression as a default, resulting in better performance when compared to ORC that uses ZLIB in its Default configuration.

HiveQL/Spark MLlib

Tab. 3.57 shows results for the HiveQL/Spark MLlib query type. HiveQL code is used to extract and prepare input data for machine learning algorithms. Usually, input is stored in a temporary table, processed with Spark MLlib Java library and the output is stored in a new, additional table.

Tab. 3.57.: Hive Results for HiveQL/Spark MLlib query type.

Query execution time is reported in minutes. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q05 147 < 153 148 < 154 146 < 149 Q20 65 < 69 68 ≈ 69 66 < 69 Q25 34 < 42 38 < 42 35 < 41 Q26 24 < 30 26 < 30 25 < 28 Q28 6 ≈ 6 6 ≈ 6 6 ≈ 6 Total Time 276 < 300 286 < 301 279 < 294 (min.) Performance Improvement 8.0 % ←− 5.0 % ←− 5.1 % ←− (PI) % Compression base- Improvement 3.8 % baseline baseline 2.6 % 2.5 % line (CI) %

Looking at the Default column, we can see that all the queries except Q28 perform better on ORC than Parquet. Q28 performance is not affected by any change in the file format configuration. The PI between the two file formats is 8.0%.

140 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Moving to the No compression column, the previous performance pattern is confirmed but the PI decreases to ∼5.0%. The introduction of Snappy data compression improves the performance for both file formats in a similar amount, respectively ∼2.6% for ORC (blue cells) and ∼2.5% for Parquet (orange cells). Indeed, ORC results as the file format winner also for the Snappy configuration.

3.5.5 Spark SQL Results

In the next sections we show results for the Spark SQL processing engine, again divided in the four BigBench query types as listed in Tab. 3.51. The first column reports the query number, while the following columns report execution times for each combination of configuration (Tab. 3.52) and file format type. The Total Sum of query execution times is shown at the end of the table, while the very last two rows report the PI% and CI%.

Query execution time is reported in seconds and green cells highlight the best time between file formats within each of our configurations. Pairs of cells with the same color (white) are used to show similar performance between the two file formats. We consider query performance comparable/equal if the difference is lower or equal to 5% of the highest execution time. The standard deviation in % between the 3 query executions for all SparkSQL queries varies greatly. Queries Q02 and Q30 achieve standard deviations varying between 7% and 16%, which will be explained in Subsection 3.5.5. All other queries have standard deviations around 10%, which indicates that SparkSQL is less stable than Hive as reported in [IB15a; IB15d]. We believe this is also due to execution noise in the cluster affecting more SparkSQL times that are generally much shorter compared to Hive ones.

Similar to the Hive engine, we will first discuss the performance of the Default configuration. Next, we observe No compression and Snappy configurations to get insights and better understand the performance behavior. Generally, query performance on Spark showed more variable results. This is also due to shorter execution times with respect to Hive and therefore bigger noise influence. The query source code for the Spark experiments is exactly the same as the one in Hive and therefore we use the same query groupings in this section.

Pure HiveQL

Tab. 3.58 reports results for the Pure HiveQL query type. In general, Parquet seems to perform better in all the three configurations.

Looking at the Default configuration, Parquet performs better than ORC for many of the queries despite the former does not use compression and the latter uses ZLIB. Some queries perform similarly on both file formats (i.e., Q07, Q11, Q21), while other achieve considerably lower execution times with ORC (i.e., Q16, Q22, Q24). In particular Q22 shows a huge improvement of ∼79.3% when run on ORC with the Default configuration compared to the No compression and Snappy configurations.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 141 Tab. 3.58.: Spark Results for Pure HiveQL query type.

Query execution time is reported in seconds. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q06 894 > 467 894 > 480 884 > 475 Q07 114 ≈ 118 130 > 123 116 ≈ 117 Q09 202 > 136 199 > 140 201 > 135 Q11 126 ≈ 121 134 > 116 117 > 110 Q12 331 > 105 331 > 118 314 > 122 Q13 278 > 155 274 > 176 271 > 150 Q14 148 > 138 138 ≈ 140 143 > 126 Q15 141 > 118 133 ≈ 132 139 > 125 Q16 451 < 613 677 > 630 665 ≈ 636 Q17 254 > 163 245 > 168 246 > 155 Q21 270 ≈ 262 244 ≈ 232 233 > 210 Q22 135 < 654 680 ≈ 688 687 > 635 Q23 155 > 145 167 ≈ 160 170 > 132 Q24 164 < 174 166 ≈ 160 164 > 154 Total Time 3662 > 3369 4412 > 3463 4349 > 3282 (sec.) Performance Improvement −→ 8.0 % −→ 21.5 % −→ 24.5 % (PI) % Compression base- Improvement 20.5 % baseline baseline 1.5 % 5.5 % line (CI) %

142 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Further experiments with ZLIB compression need to be performed in order to understand this behavior, but this is out of the scope of our work.

Moving to the No compression column, it is more clear that when the two file formats are under the same conditions Parquet is the optimal choice. At the same time, a group of six queries show similar performance between the two formats (i.e., Q14, Q15, Q21, Q22, Q23, Q24). The PI% is significant with Parquet taking ∼21.5% less time than ORC to execute the full set of queries.

On the Snappy column we can observe that both file formats get a performance benefit from the application of data compression. Parquet gets an improvement of ∼5.5% (orange cells), while ORC gets a more modest improvement of ∼1.5%. Parquet clearly demonstrates its performance advantages over ORC in all queries with the exceptions of Q07 and Q16 that perform similarly. The PI% between the two formats is even more significant than the No compression case, reaching ∼24.5%.

MapReduce/Python

Tab. 3.59 reports results for the MapReduce/Python query type.

Tab. 3.59.: Spark Results for MapReduce/Python query type.

Query execution time is reported in seconds. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q01 137 > 127 127 < 146 122 ≈ 128 Q02 15644 < 19287 19471 > 15937 19441 > 17580 Q03 2709 ≈ 2683 2750 ≈ 2737 2713 ≈ 2653 Q04 6319 < 7062 7375 ≈ 7032 6727 ≈ 6752 Q08 913 > 675 911 > 712 900 > 639 Q29 203 < 504 197 < 517 199 > 184 Q30 6837 < 15253 8690 < 11588 9368 < 11539 Total Time 32762 < 45592 39522 > 38669 39471 ≈ 39474 (sec.) Performance Improvement 28.1 % ←− −→ 2.1 % 0 % ←− (PI) % Compression base- Improvement 20.6 % baseline baseline 0.1 % -2.0 % line (CI) %

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 143 First, we need to say that Q02 and Q30 are highly unstable on Spark when using the Parquet file format. The two queries randomly fail with "Out of Memory" errors, but in many cases they complete successfully. To obtain the results in Tab. 3.59 we ran Q02 and Q30 more than three times with Parquet until we were able to get 3 successful run completions. We report the results for completeness, but we do not consider them trustworthy for our discussion. We did not observe this behavior with any of the ORC configurations.

With the Default configuration, ORC performs better with the majority of the queries except Q01 and Q08 performing better on Parquet and Q03 showing similar behavior. The PI% shows a ∼28.1% difference between the two formats. One reason for this behavior is that Q30 on Parquet is taking more than twice the time needed to execute it on ORC. We observed that Spark uses a lot more memory with Parquet and often fails with the aforementioned errors.

Moving to the No compression column, there is no clear winner between ORC and Parquet. Q08 performs better on Parquet, while Q01 and Q29 perform better on ORC. Q03 and Q04 show similar behavior on both file formats.

For the Snappy configuration, we cannot say which file format performs better. It is interesting to observe that the CI% achieved with the introduction of Snappy compression is negligible for ORC (∼0.1%) and even negative for Parquet (∼−2.0%). We should keep in mind that the total times are highly influenced by the unstable behavior of Q02 and Q30 on Parquet.

Finally, it is hard to conclude anything meaningful for the MapReduce/Python query type on Spark. By changing the file format configuration, many queries show no change in their performance (Q03 and Q04), while others show contradictory behavior (Q29 and Q01). Only Q08 performs always better on Parquet for all the configurations.

HiveQL/OpenNLP

Tab. 3.60 reports results for the HiveQL/OpenNLP query type.

On the Default column, execution time is shorter on Parquet because it uses un- compressed data as default parameter. Q19 shows no change in performance and this behavior is confirmed also in the other columns of the table. In general, Q19 performance looks not to be affected by any change in the file format configuration.

Similar to the Hive behavior in Section 3.5.4, we can observe this set of queries in Spark SQL performs better on non-compressed data for both file formats. Comparing No compression and Snappy we can observe how enabling the data compression decreases the performance for all queries and for both file formats. In fact, the CI% is negative for both ORC (blue cells) and Parquet (orange cells).

144 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Tab. 3.60.: Spark Results for HiveQL/OpenNLP query type.

Query execution time is reported in seconds. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q10 1994 > 976 2026 > 1633 2070 < 3075 Q18 2711 > 2011 2699 > 2519 2836 ≈ 2721 Q19 521 ≈ 525 526 ≈ 530 535 ≈ 537 Q27 140 > 103 138 ≈ 128 139 ≈ 145 Total Time 5366 > 3615 5388 > 4811 5581 < 6477 (sec.) Performance Improvement −→ 32.6 % −→ 10.7 % 13.8 % ←− (PI) % Compression base- Improvement 0.4 % baseline baseline -3.5 % -25.7 % line (CI) %

Looking at the No compression column, Q19 and Q27 show comparable performance, while Q10 and Q18 perform better on Parquet. The PI% of ∼10.7% is in favor of Parquet.

As stated above, both file formats performance is decreasing with the Snappy config- uration compared to the Default and No Compression. It is interesting to observe that Q10 is faster with ORC and Snappy compared to Parquet and Snappy.

HiveQL/Spark MLlib

Tab. 3.61 reports results for the HiveQL/Spark MLlib query type.

Looking at the Default configuration, ORC achieves a better Total Time, with a PI% of ∼36.8%. Q20 and Q25 perform better on Parquet, whereas Q28 does not show performance differences for any of the configurations.

Moving to the No compression column, ORC is still the best option in terms of total time but the PI% between the two formats is similar. The most unusual behavior has Q05. It takes 618 seconds on ORC Default (using ZLIB compression) compared to 1842 seconds on ORC No compression. All other queries do not seem to be affected by the configuration changes.

With Snappy data compression, the two formats show similar performance in all queries, with the exception of Q20 and Q25 that perform better on Parquet. While

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 145 Tab. 3.61.: Spark Results for HiveQL/Spark MLlib query type.

Query execution time is reported in seconds. Green cells, as well as <> symbols, highlight the best time between file formats within each configuration. Pairs of cells with similar performance are filled with the same color (white or light gray) and divided by the ≈ symbol. PI% and CI% respectively show the performance difference between formats and between uncompressed and compressed configurations. Default No compression Snappy Query ORC Parquet ORC Parquet ORC Parquet Q05 618 < 1825 1842 ≈ 1900 1811 ≈ 1787 Q20 343 > 298 343 > 299 345 > 294 Q25 462 > 359 453 > 363 440 > 346 Q26 272 < 374 266 < 365 281 ≈ 290 Q28 304 ≈ 309 297 < 326 308 ≈ 322 Total Time 1998 < 3165 3201 ≈ 3252 3187 ≈ 3039 (sec.) Performance Improvement 36.8 % ←− 1.5 % ←− −→ 4.6 % (PI) % Compression base- Improvement 60.2 % baseline baseline 0.5 % 7.0 % line (CI) %

Parquet gets a significant improvement thanks to data compression (∼7.0%, orange cells), ORC gets a minimal improvement in the Total Time. Only queries Q05 and Q25 perform better when using ORC with Snappy instead of No Compression, whereas all the remaining queries (Q20, Q26, Q28) result in longer execution times.

Summary

Next we summarize the major findings of all experiments reported in this section:

1. Using the "out-of-the-box"/Default file format configuration is not always the optimal choice and depends heavily on the data structure and query type (Subsections 3.5.4 and 3.5.5).

2. ORC generally achieves best performance with the Hive engine with a Perfor- mance Improvement (PI%) going from ∼5% to ∼10%. The HiveQL/OpenNLP query type (Subsection 3.5.4) makes an exception, with Parquet performing better on No compression (∼4.9% of PI) and similarly with Snappy.

3. Parquet generally achieves best performance with the Spark engine except for HiveQL/Spark MLlib (Subsection 3.5.5) and MapReduce/Python (Subsection 3.5.5) where behavior is unclear.

146 Chapter 3 Evaluation of Big Data Platforms and Benchmarks 4. In most cases using Snappy compression improves the performance on both file format and both engines, except for the OpenNLP query type, where we observe negative influence with both engines (Subsections 3.5.4 and 3.5.5). In particular, the Compression Improvement (CI%) for Parquet is negative on both Hive (−18.4%) and Spark (−25.7%).

5. Queries Q02 and Q30 perform unstable on Spark (Subsection 3.5.5).

3.5.6 In-Depth Query Analysis

In this section, we perform a deeper query investigation with the goal to identify the causes of the varying performance behavior reported in the previous sections. Due to time limitations, we select one representative query from each of the four query types in BigBench. These are Q08 (MapReduce/Python), Q10 (HiveQL/OpenNLP), Q12 (Pure HiveQL) and Q25 (HiveQL/Spark MLlib).

Fig. 3.58.: The picture illustrates the process of collecting the Spark History Server metrics and source code details into a summary table for query Q08 executed on ORC configured with No Compression.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 147 We re-executed the queries on SparkSQL configured with No compression and Snappy for ORC and Parquet. Each of them shows different behavior with respect to the other queries in the same group. We analyze them with the help of the Spark History Server metrics, the query source code and the resource utilization data. Fig. 3.58 shows all the information we collected about a query (i.e., Q08 with No Compression on ORC) by looking at the Spark History Server and source code. In particular we collect the Spark tasks, stages, operations as well as Input and Output data sizes. A Spark query can be split into a number of jobs. A Spark job is a set of tasks that are resulting from Spark operation (action). A task is an individual unit of work that is executed on one executor/container. A Spark stage is a set of tasks in a job that can be executed in parallel. The operations are either actions or transformations executed in a job (i.e. HadoopRDD or FileScanRDD). The Input size is the data read in a stage, whereas the Output size is the data written/resulting from a stage. More details about the Spark internals are available in the Spark documentation [18ad] . Using this information we compiled summary tables for each query execution to better understand the unexpected behavior and compare the different file format configurations. The complete tables are available on Github [18o] under file Evaluation-summary.xlsx. In addition, we collected performance metrics using the Performance Analysis Tool (PAT) [Too18] by Intel. The tool collects data from all cluster nodes introducing minimal overhead on the benchmark workload. At the end, it aggregates them and immediately produces Excel files with visual charts, which will be used in our analysis. All generated charts are available on Github [18o] under file selected-queries-resource.pdf. In the next subsections we discuss our findings for each of the selected queries.

BigBench Q08 (MapReduce/Python)

Q08 investigates the effect of review reading by customers on the sales revenue in a specific time and for a specific product category. The query can be divided in two distinct phases. In the first phase, three temporary tables are created and stored in plain text format. The first table contains rows representing dates for the desired time period. The second stores web sessions of users who read product reviews. A Python program is used to build this. The third and last table stores sales in the desired time period. In the second phase, the temporary tables are combined to get the final result. To better understand the file format impact we focus primary on the first phase where the data is retrieved from the file format structure.

Comparing ORC and Parquet As shown in Tab. 3.59, Q08 performs better on Parquet for both No Compression and Snappy configurations. Similarly for both configurations, Parquet reads less input data for the key execution stages compared to ORC.

No Compression Configuration: The Parquet execution (12 min.) is faster than the ORC execution (16 min.). The ORC execution results in 14 Spark stages with total Input of 111.6 GB compared to 18 Spark stages with total Input of 94.5 GB for the Parquet execution. In ORC, stage 3 (5.8 min.) performs the same operations as stage 6 (2.8 min.) in Parquet. Both stages execute 1402 tasks. However, ORC stage

148 Chapter 3 Evaluation of Big Data Platforms and Benchmarks 3 takes as Input 95.7 GB and Outputs 15.2 GB, whereas Parquet stage 6 takes 77.8 GB for the same Output.

Fig. 3.59 shows the CPU utilization (blue peaks in the yellow square) for the No Compression configuration. The first block of lines showing high CPU utilization should correspond to the first phase. Here, relevant data for the query is retrieved by using file format indexes and metadata. ORC spends more CPU time for User processes with respect to Parquet. For the latter, the majority of CPU time in the first phase is spent waiting for I/O (disk) operations. This can be interpreted as Spark being better in retrieving relevant data from Parquet and immediately triggering I/O requests.

(a) ORC

(b) Parquet

Fig. 3.59.: CPU utilization for Q08 with No Compression configuration.

Snappy Configuration: The Parquet execution (11 min.) is faster than the ORC execution (15 min.). Both formats execute in 14 Spark stages. Compared the total Input for ORC is 73.7 GB, whereas for Parquet it is 61.8 GB. In ORC, stage 3 (5.5 min.) performs the same operations as stage 3 (2 min.) in Parquet. The number of executed tasks is almost equal with 1402 for ORC and 1401 for Parquet. However, the ORC stage 3 takes 61.6 GB as Input, whereas the Parquet takes only 48 GB, with Output for both 15.2 GB. With respect to executed operations, ORC performs one HadoopRDD and 5 MapPartitionsRDD operations, whereas Parquet performs one FlieScanRDD and 2 MapPartitionsRDD operations.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 149 ORC Execution with No Compression configuration takes 16 min., whereas with Snappy configuration takes 15 min. Both configurations execute for 14 Spark stages. Compared the total Input for No Compression configuration is 111.6 GB, whereas the total Input for Snappy configuration is 73.3 GB.

Parquet Execution with No Compression configuration takes 12 min., whereas with Snappy configuration takes 11 min.. No Compression configuration is executed for 18 stages, whereas the Snappy configuration for 14 stages. Compared the total Input for No Compression configuration is 94.5 GB, whereas the total Input for Snappy is 61.8 GB.

(a) No Compression

(b) Snappy

Fig. 3.60.: Disk Requests for Q08 with Parquet.

The use of Snappy compression brings some improvements only for Parquet. Fig. 3.60 shows a huge drop in I/O requests (blue peaks in the yellow square). This is the expected behavior when using compression: to reduce the number of disk accesses while retrieving the same data.

BigBench Q10 (HiveQL/OpenNLP)

Q10 performs sentiment analysis on the product reviews by classifying them as positive or negative. The output also reports the words that lead to the classification. The query does not create any temporary table, but it works directly on data stored

150 Chapter 3 Evaluation of Big Data Platforms and Benchmarks in columnar file formats. The sentiment extraction is realized with a Java UDF embedded in the query code.

Comparing ORC and Parquet This query type is particularly interesting because it generally performs better with No compression configuration, as shown in Tab. 3.60. This is unexpected behavior, as the data compression usually brings benefit to the performance by reducing the I/O accesses. In the specific case of Q10, we get better performance on Parquet when using uncompressed format. On the other hand, the query performs better with Snappy compression on ORC, but the performance improvement is less relevant.

No Compression Configuration: The Parquet execution (29 min.) is faster than the ORC execution (34 min.). The ORC execution results in 3 Spark stages, whereas the Parquet execution results in 4 Spark stages. In terms of total Input data both take 5 GBs. However, in ORC stage 0 executes on 9 tasks for 16 min., compared to Parquet stage 1 executed on 28 tasks for 14 min. Similarly, in ORC stage 1 executed on 9 tasks for 18 min., compared to Parquet stage 2 executed on 28 tasks for 14 min. ORC stage 0 performs one HadoopRDD and 7 MapPartitionsRDD operations, whereas Parquet stage 1 performs one FileScanRDD and 5 MapPartitionsRDD operations. ORC stage 1 performs one HadoopRDD and 5 MapPartitionsRDD operations, whereas Parquet stage 2 performs FileScanRDD and 3 MapPartitionsRDD operations. We can conclude that Parquet takes advantage of parallel execution by performing 28 tasks in parallel compared to 9 tasks for ORC.

Snappy Configuration: The ORC execution (37 min.) is faster than the Parquet execution (51 min.). The ORC executions results in 3 Spark stages, whereas the Parquet execution results in 4 Spark stages. ORC takes as total Input 3.4 GB, whereas Parquet takes as total Input 4.7 GB. The biggest difference is that stage 1 in Parquet is not comparable to any of the ORC stages. Parquet stage 1 takes 15 min. and performs one FileScanRDD, 4 MapPartitionsRDD, one PartitionPrunningRDD and one PartitionwiseSampleRDD, which are not performed in ORC. Both operations PartitionPrunningRDD and PartitionwiseSampleRDD are not typical for this Spark context and lead to unnecessary increase of the total execution time. Therefore, further investigation for the cause of this behavior needs to be done.

For what concerns Parquet, we hypothesize that Q10 is performing better on No compression configuration because it is harder to efficiently compress unstructured data. Anyway, the behavior is not confirmed when using ORC. Because of this discrepancy we look into the resource utilization. Fig. 3.61 shows the disk requests for the No compression configuration. We can immediately notice that the disk is under utilized with both file formats. The same happens with Snappy configurations and also for the disk bandwidth utilization and CPU utilization, that we do not report in this document for the sake of space.

In this case, the motivation for the query performance behavior is unclear. If the benchmark user is specifically interested in this query type, he should further investigate Q18, Q19, Q27. If no relevant insight is found, the UDF code should be checked for implementation issues.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 151 (a) ORC

(b) Parquet

Fig. 3.61.: Disk requests for Q10 with No Compression configuration.

ORC The Snappy configuration takes 37 min. and total input of 5 GB compared to 34 min. and total input of 3.4 GB with No Compression configuration. Beside the smaller input data, using compression results in worse performance than with No Compression.

Parquet The Snappy configuration takes 51 min. and 4.7 GB total Input compared to 29 min. and 5 GB total input with the No Compression configuration. Both configurations have 4 Spark stages, but stage 1 with Snappy is not performing the same operations as already mentioned above. Beside the smaller input data, using compression results in worse performance than with No Compression.

BigBench Q12 (Pure HiveQL)

Q12 searches for customers who viewed products online and then bought a product from the same category in a physical store in the next three months. No temporary tables are created to fulfill the query, meaning that all the relevant data is retrieved directly from the columnar file format. The whole query period should be relevant for the analysis.

152 Chapter 3 Evaluation of Big Data Platforms and Benchmarks Comparing ORC with Parquet Tab. 3.58 shows that Q12 always performs better on Parquet compared to ORC. Snappy compression brings a slight improvement for ORC, while it worsens a bit the Parquet performance. It is the only query that does not get a performance improvement on the Snappy configuration with Parquet. For both configurations Parquet reads much less input data compared to ORC, which has a clear impact on the performance.

No Compression Configuration: The Parquet execution (1.7 min.) is faster than the ORC execution (5.4 min.). The ORC execution is in 8 Spark stages with total Input of 82.6 GB, whereas the Parquet execution is in 10 stages with 2.9 GB total Input. Both are executed on 1402 Spark tasks. The ORC stage 1 takes 3.8 min. compared to the Parquet stage 4 taking 15 sec..

Snappy Configuration: The Parquet execution (1.4 min.) is faster than the ORC execution (5.2 min.). Both formats execute in 8 Spark stages. The ORC stage 1 takes 3.6 min. with 40.9 GB total Input data compared to the Parquet stage 1 taking 11 sec. with 1172 MB total Input. Similarly, the ORC stage 2 takes 32 sec. and 4.4 GB total Input compared to the Parquet stage 2 taking 4 sec. and 354 MB total Input.

ORC Using Snappy compression takes 5.2 min. and 45.3 GB compared to No Compression with 5.4 min. and 82.6 GB. Overall, using compression with ORC slightly improves the performance.

Parquet Using Snappy compression the query takes 1.4 min and 1.5 GB compared to No Compression configuration with 1.7 min. and 2.9 GB. Overall, using com- pression with Parquet slightly improves the performance. As the reader can notice, measurements in the run with PAT show a different result with respect to Tab. 3.58.

To understand the aforementioned discrepancy, we look at the PAT utilization charts. The disk requests and bandwidth utilization profiles are very similar between No compression and Snappy configurations. We do not report them here since we can’t get any useful insight from them. Fig. 3.62 shows the CPU utilization for Q12 on the Parquet file format for both configurations. We observe a generally higher utilization with Snappy configuration. This is usually expected since data retrieved from disk must be uncompressed, wasting valuable CPU time. As observed in all the other queries in Tab. 3.58, despite the need of extra computation, data compression is beneficial for the performance since it reduces the number of disk accesses. The unexpected behavior of Q12 can be explained with the fact that the query reads small amount of data from the disk the benefit from using compression is not appreciable. We believe that the discrepancy between Tab. 3.58 and PAT execution is due to noise and the difference between No Compression and Snappy configuration should be considered negligible.

BigBench Q25 (HiveQL/Spark MLlib)

Q25 groups customers based on a set of shopping dimensions. To achieve this, it uses a k-means clustering algorithm [HW79] implemented using the Spark MLlib. The

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 153 (a) No compression

(b) Snappy

Fig. 3.62.: CPU utilization for Q12 with Parquet.

query is split into two phases: the first phase creates a temporary table to prepare data for the clustering algorithm, while the second phase simply executes a Spark job and stores results. As usual, the temporary table is stored in plain text making only the first phase relevant for our performance analysis on file formats.

Comparing ORC and Parquet Tab. 3.61 reports that Q25 always performs better on Parquet for both No compression and Snappy configurations. The data compression brings little improvement when using Parquet (∼4.7%), while ORC gets a minimal improvement. Other queries in Tab. 3.61 show worse performance on ORC for Snappy configuration with respect to No compression.

No Compression Configuration: The Parquet execution (3.4 min.) is faster than the ORC execution (5 min.). Both execute in 13 Spark stages and the same number of tasks. The ORC execution takes as Input 18.3 GB, whereas the Parquet execution takes as total Input 23.8 GB. However, ORC stage 1 and 5 execute HadoopRDD and 5 MapPartitionsRDD, whereas Parquet stage 1 and 5 execute FileScanRDD and 2 MapPartitionsRDD.

Snappy Configuration: The Parquet execution (3.3 min.) is faster than the ORC execution (5 min.). Both execute in 13 Spark stages and the same number of tasks. The ORC execution takes 10.7 GB as total Input, whereas the Parquet execution takes 15.7 GB as total Input. Again, ORC stage 1 and 5 execute HadoopRDD and

154 Chapter 3 Evaluation of Big Data Platforms and Benchmarks 5 MapPartitionsRDD, whereas Parquet stage 1 and 5 execute FileScanRDD and 2 MapPartitionsRDD.

Fig. 3.63 compares the CPU utilization on ORC and Parquet for Snappy configuration. As already observed for Q08, more CPU user time (blue peaks in the yellow square) is spent on ORC in the data retrieving phase. It looks like that on Parquet is easier for Spark to address relevant data and request it from the disk. In fact, CPU user time is lower on Parquet, while a large part of the total CPU time is spent for I/O wait (indicated by the green color in the yellow squares).

(a) ORC

(b) Parquet

Fig. 3.63.: CPU utilization for Q25 with Snappy configuration.

ORC ORC performs equal with both Snappy and No Compression configurations taking 5 min., 13 stages and equal number of tasks with 18.3 GB total Input for No Compression and 10.3 GB total Input for Snappy. Overall, using compression with ORC do not improve the performance.

Parquet Similar to ORC, Parquet also performs equal with both Snappy and No Compression configurations taking around 3.4 min. , 13 stages and equal number of tasks with 23.8 GB total Input for No Compression and 15.7GB total Input for Snappy.

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 155 Summary

1. The most important lesson learned is that changing the file format influences the overall engine behavior. The functions used to retrieve data are different: with ORC, Spark prefers to use HiveTableScan, while with Parquet it prefers to use FileScanRDD. This emphasizes the importance of the file format choice when using a specific engine.

2. In many cases we observed that both formats vary in the number of Spark stages, tasks and operations, which they execute for the same HiveQL query code.

3. Generally, the introduction of Snappy compression improves the query perfor- mance for both file formats by reading less data from disk. However, the query performance is not only influenced by the amount of input data from disk. For Q25, Parquet shows better performance than ORC while still reading more data.

4. The HiveQL/OpenNLP Q10, shows abnormal behavior. Cluster resources are under utilized and Snappy compression worsens the performance. While the latter can be caused by the unstructured data type, the cluster under utilization can be caused by flawed workload code in the benchmark. Further investigation in the Java UDF source code is necessary.

3.5.7 Summary and Lessons Learned

To the best of our knowledge this is the first study that evaluates the ORC and Parquet file formats on Hive and SparkSQL using the BigBench benchmark as a popular representative of Big Data workloads. From our benchmark results, we can observe that it is important to separate the file format evaluation from the engine. Both components have great influence on the workload and a comparison of their combination is meaningless because we can not tell what component is really causing a change in performance. In Section 3.5.6 we showed how a different file format selection changes the Spark execution behavior. In particular, different functions are used to retrieve data respectively from ORC or Parquet files. We believe that our benchmark methodology in which we keep the processing engine fixed while changing the file format is correct.

At the same time, overall performance on the same engine is greatly influenced by the file format parameters. The default configurations of ORC and Parquet are extremely different and their direct performance comparison can lead to misleading results. Therefore, the file format selection can not be naively based on the engine preference stated by documentation or by quick-and-dirty benchmarks. A careful understanding of file format parameters, especially the use of data compression, is necessary to make the optimal choice.

Once this is setup, a careful understanding of the benchmark workload is necessary. We showed that ORC generally achieves best performance with the Hive engine (Section 3.5.4), while Parquet generally achieves best performance with the Spark

156 Chapter 3 Evaluation of Big Data Platforms and Benchmarks engine (Section 3.5.5). However, this is not true for all BigBench query types with significant exceptions as shown in Subsections 3.5.4, 3.5.5, 3.5.5. Users should identify and select groups of queries in BigBench that best emulate their use case and then evaluate performance. This will not only reduce time needed to execute benchmarks, but will also give more valuable results. In this respect, BigBench is extremely flexible: in this work we ran the whole set of queries but the user can select and run only a subset of them.

Similarly, the use of data compression is not advised for all query types. In most cases using Snappy compression improves the performance on both file formats and both engines, except for the OpenNLP query type, where we observe negative influence with both engines (Subsections 3.5.4 and 3.5.5).

What are the best practices, methodologies and metrics when comparing dif- ferent file formats?

Based on our experimental results and benchmarking methodologies that we imple- mented to compare the file formats performance in this study (in particular ORC and Parquet), we assembled a list of steps that can be used as a best practices guide when testing file formats on a new SQL-on-Hadoop engine:

• Make sure the new engine offers support for the file formats under test (ORC and Parquet).

• Choose a benchmark or test workload representing your use case and suitable for the comparison.

• Set the file format configurations accordingly (with or without compression).

• Generate the data and make sure that it is consistent with the file configuration using the file format (ORC and Parquet) tools.

• Perform the experiments (at least 3 times) and calculate the average execution times.

• Compare the time differences from the two file formats using the PI % and CI % metrics and make conclusions.

• Select queries compatible with your specific use-case. Execute them while collecting resource utilization data and perform in-depth query evaluation to spot bottlenecks and problems.

One important evaluation factor mentioned above is the benchmark used for stress testing the engines. As reviewed by the related work (Subsection 3.5.2), TPC-H and TPC-DS are the most common choices for benchmarking both engine and file format performance. In this study, we utilize the BigBench benchmark as it is currently the only standard benchmark including structured, unstructured and semi-structured data type as well as machine learning and text processing workloads.

Is BigBench suitable for file format comparisons?

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 157 Based on our experimental results (Sections 3.5.4, 3.5.5 and 3.5.6), we can conclude that BigBench is a good choice for comparing file formats on SQL-on-Hadoop engines mainly for two reasons: (1) structured and unstructured data influence the query performance particularly in combination with compression (Q10 in Subsection 3.5.6); and (2) the BigBench variety of 30 different workloads (use cases) divided in four categories based on implementation type.

However, many other questions around benchmarking file formats still remain open:

• Is there a need for a specialized micro-benchmark to better investigate the file format features?

• If yes, what should this benchmark include in terms of data types and opera- tions?

• What are the file format features that such a benchmark should stress (for example block size, compression etc.)?

As a future work, we plan to investigate these questions. For example, further insights can be obtained by running benchmarks on the same processing engine and file format while changing the file format parameters. This would help to better distinguish the influence of each architecture component on the query performance. New metrics can be added to the in-depth query analysis, like the amount of network traffic exchanged by each node in the cluster on relevant TCP and UDP ports, as shown in [FPR16], to spot bottlenecks and unbalanced workloads.

Resource utilization data collection and the comparison of graphs and execution plans can be standardized and integrated into BigBench for better usability. Data analysis and visualization can then be included in a graphical user interface, like Apache Hue [18r] which is highly customizable.

We also plan to evaluate columnar file formats with other benchmarks and dataset types, leading to their adoption in other applications. For example, ORC and Parquet together with Hive or SparkSQL can be used to store and query time-series data, like data coming from sensors. Software architectures for sensor data analysis can select these technologies for their batch layer to achieve better access performance with respect to simple plain text files on HDFS, or to replace complex NoSQL systems [Bar+18; Raj+18]. However, dedicated benchmarks are needed to ensure the suitability of columnar file formats for range-based queries, that are commonly used to access time-series data.

3.5.8 Conclusions

To the best of our knowledge this is the first study that evaluates the ORC and Parquet file formats on Hive and SparkSQL using the BigBench benchmark as a popular representative of Big Data workloads. From our benchmark results, we can observe that it is important to separate the file format evaluation from the engine. Both components have great influence on the workload and a comparison of their

158 Chapter 3 Evaluation of Big Data Platforms and Benchmarks combination is meaningless because we can not tell what component is really causing a change in performance. In Section 3.5.6 we showed how a different file format selection changes the Spark execution behavior. In particular, different functions are used to retrieve data respectively from ORC or Parquet files. We believe that our benchmark methodology in which we keep the processing engine fixed while changing the file format is correct.

At the same time, overall performance on the same engine is greatly influenced by the file format parameters. The default configurations of ORC and Parquet are extremely different and their direct performance comparison can lead to misleading results. Therefore, the file format selection can not be naively based on the engine preference stated by documentation or by quick-and-dirty benchmarks. A careful understanding of file format parameters, especially the use of data compression, is necessary to make the optimal choice.

Once this is setup, a careful understanding of the benchmark workload is necessary. We showed that ORC generally achieves best performance with the Hive engine (Section 3.5.4), while Parquet generally achieves best performance with the Spark engine (Section 3.5.5). However, this is not true for all BigBench query types with significant exceptions as shown in Subsections 3.5.4, 3.5.5, 3.5.5. Users should identify and select groups of queries in BigBench that best emulate their use case and then evaluate performance. This will not only reduce time needed to execute benchmarks, but will also give more valuable results. In this respect, BigBench is extremey flexible: in this work we ran the whole set of queries but the user can select and run only a subset of them.

Similarly, the use of data compression is not advised for all query types. In most cases using Snappy compression improves the performance on both file formats and both engines, except for the OpenNLP query type, where we observe negative influence with both engines (Subsections 3.5.4 and 3.5.5).

In future, we plan to perform more experiments with different configurations by modifying other file format parameters and investigate their influence. Further- more, we plan to understand the cause for instability of queries Q02 and Q30 on Spark (Subsection 3.5.5), as well as the resource under utilization found with Q10 (Subsection 3.5.6).

Acknowledgements

This research was supported by the Frankfurt Big Data Lab (Chair for Databases and Information Systems - DBIS) at the Goethe University Frankfurt. Special thanks for the help and valuable feedback to Sead Izberovic, Thomas Stokowy, Karsten Tolle, Roberto V. Zicari (Frankfurt Big Data Lab), Pinar Tözün (IBM Almaden Re- search Center, San Jose, CA, USA), and Nicolas Poggi (Barcelona Super Computing Center).

3.5 The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 159

ABench: Big Data Architecture 4 Stack Benchmark

This chapter is structured as follows:

• Section 4.1 extends the standardized Big Data benchmark BigBench (TPCx-BB) into BigBench V2.

• Section 4.2 evaluates Hive and Spark SQL using BigBench V2.

• Section 4.3 extends BigBench V2 with streaming component stressing the velocity characteristics in Big Data applications.

• Section 4.4 evaluates the Spark Structured Streaming using the streaming extension of BigBench V2.

• Section 4.5 defines a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data archi- tectures.

Figure 4.1 depicts the relations between chapters 3, 4 and 5 in the thesis.

In particular, the experience of using the standardized BigBench (TPCx-BB) for evaluating a SQL-on-Hadoop engines inspired the development and implementation of a new version called BigBench V2 (Sections 4.1 and 4.2).

While experimenting with BigBench V2 and exploring the concept of heterogeneity in Big Data platforms, it became clear that there is a need of a more comprehensive end-to-end Big Data benchmark that can be easily extended to cover new types of Big Data workloads such as streaming queries and complex machine learning pipelines. This lead to the vision for a new benchmark, called Big Data Architecture Stack Benchmark or short ABench (Section 4.5).

The ABench vision is the central concept of this chapter (marked in green on Figure 4.1) and it builds on top of the existing BigBench V2, which we also extended with a streaming component described in section 4.3. Currently, two further extension of ABench, which are presented in chapter 5 (marked in yellow on Figure 4.1), are under development. The first one extends the machine learning workloads with new one (Section 5.1.2), while the second one focuses on building a new flexible platform infrastructure (Section 5.1.1) to enable seamlessly both Cloud and On-premise benchmark deployments.

161 Fig. 4.1.: ABench Roadmap

162 Chapter 4 ABench: Big Data Architecture Stack Benchmark 4.1 BigBench V2: The New and Improved BigBench

Abstract

Benchmarking Big Data solutions has been gaining a lot of attention from research and industry. BigBench is one of the most popular benchmarks in this area which was adopted by the TPC as TPCx-BB. BigBench, however, has key shortcomings. The structured component of the data model is the same as the TPC-DS data model which is a complex snowflake-like schema. This is contrary to the simple star schema Big Data models in real life. BigBench also treats the semi-structured web-logs more or less as a structured table. In real life, web-logs are modeled as key-value pairs with unknown schema. Specific keys are captured at query time - a process referred to as late binding. In addition, eleven (out of thirty) of the BigBench queries are TPC-DS queries. These queries are complex SQL applied on the structured part of the data model which again is not typical of Big Data workloads. In this paper1, we present BigBench V2 to address the aforementioned limitations of the original BigBench. BigBench V2 is completely independent of TPC-DS with a new data model and an overhauled workload. The new data model has a simple structured data model. Web-logs are modeled as key-value pairs with a substantial and variable number of keys. BigBench V2 mandates late binding by requiring query processing to be done directly on key-value web-logs rather than a pre-parsed form of it. A new scale factor-based data generator is implemented to produce structured tables, key-value semi-structured web-logs, and unstructured data. We implemented and executed BigBench V2 on Hive. Our proof of concept shows the feasibility of BigBench V2 and outlines different ways of implementing late binding. The paper is based on the following publication:

• Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, Roberto Zicari, BigBench V2: The New and Improved BigBench, in Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE 2017), April 19-22, 2017, San Diego, California, USA, [Gha+17a].

Keywords: Benchmark testing, Data models, Big Data, Generators.

4.1.1 Introduction

The problem of storing and analyzing Big Data in its structured, semi-structured, and non-structured forms continues to be of major interest to academic research and industrial products. Several commercial proprietary (e.g., Aster [Ast] and Cloudera [Clo]) and open source systems (e.g., Presto [Pre] and Spark [Spa17]) were developed in the past few years to tackle the challenges and grasp the opportunities of Big Data.

1Part of this work was done while Ahmad Ghazal was at Oracle, and Pekka Kostamaa and Ryan Voong were at Teradata.

4.1 BigBench V2: The New and Improved BigBench 163 As Big Data systems expand and mature, there is a need for benchmarks that help assessing their functionality and performance. So far, BigBench [Gha+13b] is arguably the only benchmark that provides an end-to-end solution for Big Data benchmarking. It was adopted - without major design or architectural changes - by the TPC [TPC18b]. However, BigBench has major shortcomings due to its reliance on TPC-DS [18af] and its simplistic handling of semi-structured data. This paper aims at addressing these shortcomings by a proposal for BigBench V2. BigBench V2 is self-contained, independent of TPC-DS, and more representative of real life Big Data systems. For the rest of this paper, we refer to the original BigBench and its TPC implementation as BigBench and we use BigBench V2 as the name of the new improved benchmark. Before describing BigBench V2, we first elaborate on the limitations of BigBench.

BigBench conveniently re-used components of TPC-DS to fill in the data model and data generation of the structured part of the benchmark. Also, eleven TPC-DS queries are used in BigBench to cover some of the retail analytics described in McKinsey’s report [Man+11]. TPC-DS is a benchmark for decision support and has a complex snowflake-like data model. TPC-DS workload is based on complex SQL constructs with lots of joins, aggregations, and sub-queries. The complex data model and queries in TPC-DS are not representative of Big Data systems and applications with simple schemas which also imply fewer joins and sub-queries.

The main limitation of BigBench is in the way it handles web-logs (i.e., semi- structured data). It handles web-logs as a structured table and all queries are processed against a fixed schema. This is contrary to real life applications, in which web-logs consist of a large and unknown set of keys that makes it impractical to parse these web-logs and create a schema out of them upfront. The practical approach in these cases is to extract the keys (i.e., columns) required to satisfy each query at run-time. This technique of looking up the structure of data at run-time is known as late binding [WC14; Liu+16].

BigBench V2 separates from TPC-DS with a simple data model. The new data model still has the variety of structured, semi-structured, and unstructured data as the original BigBench data model. The difference is that the structured part has only six tables that capture necessary information about users (customers), products, web pages, stores, online sales and store sales. We developed a scale factor-based data generator for the new data model. The web-logs are produced as key-value pairs with two sets of keys. The first set is a small set of keys that represent fields from the structured tables like IDs of users, products, and web pages. The other set of keys is larger and is produced randomly. This set is used to simulate the real life cases of large keys in web-logs that may not be used in actual queries. Product reviews are produced and linked to users and products as in BigBench but the review text is produced synthetically contrary to the Markov chain model [MT09] used in BigBench. We decided to generate product reviews in this way because the Markov chain model requires real data sets which limits our options for products and makes the generator hard to scale.

For the workload queries, all 11 TPC-DS queries on the complex structured part are removed and replaced by simpler queries mostly against the key-value web-logs. The new BigBench V2 queries have only 5 queries on the structured part versus 18

164 Chapter 4 ABench: Big Data Architecture Stack Benchmark in BigBench. This change has no impact on the coverage of the different business categories done in BigBench. In addition to the removal of TPC-DS queries, BigBench V2 mandates late binding [Bro13] but it does not impose a specific implementation of it. This requirement means that a system using BigBench V2 can extract the keys and their corresponding values per query at run-time. Other than the changes above, BigBench V2 is the same as BigBench including metric definition and computation.

The remainder of this paper is organized as follows. Section 4.1.2 covers work related to Big Data benchmarking. Section 4.1.3 describes the new simplified data model. Our scalable custom-made data generator is discussed in section 4.1.4. BigBench V2 workload queries and late binding requirements are outlined in section 4.1.5. Section 4.1.6 presents our proof of concept for BigBench V2 using Hive. Finally, Section 4.1.7 summarizes the paper and suggests future directions.

4.1.2 Related Work

Quite a few benchmarks have been proposed and developed recently to measure the performance and applicability of Big Data systems and applications [ZHZ16]. Among these different benchmarks, BigBench [Gha+13a; Rab+12a] is arguably the first concrete piece of work towards benchmarking Big Data. BigBench added semi- structured and unstructured data to TPC-DS [TPCa] and provided 30 queries on Big Data retail analytics per the McKinsey’s report [Man+11]. The work in [Cho+13c] implemented BigBench in Hadoop and developed BigBench queries using HiveQL.

Most of the other Big Data benchmarks focus on particular applications or domains. HiBench [Hua+10d] and SparkBench [Li+15; Agr+15] are micro-benchmark suites developed specifically to stress test the capabilities of Hadoop (both MapReduce and HDFS) and Spark systems using many separate workloads. Likewise, MRBS [SSB12b] provides workloads of five different domains with the focus on evaluating the dependability of MapReduce systems. CloudSuite [Fer+12a] and CloudRank- D [Luo+12a] are benchmark suites tailored for cloud systems. Both consist of multiple scale-out workloads that test a diverse set of cloud system functionality. CloudSuite focuses on identifying processor micro-architecture and memory system inefficiencies, whereas CloudRank-D stresses the data processing capabilities of cloud systems similar to HiBench. LinkBench [Arm+13] is a benchmark, developed by Facebook, using synthetic social graph to emulate social graph workload on top of databases such as MySQL. BigFUN [PCW15] is another benchmark that is based on a social network use case with synthetic semi-structured data in JSON format. The benchmark focuses exclusively on micro-operation level. The benchmark workload consists of queries with various operations such as simple retrieves, range scans, aggregations, joins, as well as inserts and updates.

More generic benchmarks include BigFrame [Big13], PRIMEBALL [Fer+13a], and BigDataBench [Wan+14a]. BigFrame offers the ability to create a benchmark cus- tomized to a specific set of data and workload requirements. PRIMEBALL [Fer+13a] includes various use cases involving both queries and batch processing on different types of data. BigDataBench [Wan+14a] is yet another effort that proposes a bench- mark suite for variety of workloads and datasets in order to address a wider range of Big Data applications. While BigDataBench, addresses the semi-structured data in

4.1 BigBench V2: The New and Improved BigBench 165 the data model, it does not take late binding into consideration as a key concept for applications dealing with semi-structured data.

A recent SPEC Big Data Research Group survey [Iva+15b] provided a summary of the existing Big Data benchmarks and those that are currently under development. The study reviewed the aforementioned benchmarks as well as other benchmarks by outlining their characteristics with the goal of helping both researchers and practitioners choose the appropriate benchmark for their needs. Comparing and con- trasting these benchmarks to BigBench, we identify the uniqueness and superiority of BigBench [Gha+13a] as follows:

• BigBench is technology agnostic, whereas many of the existing benchmarks are technology or component bound (HiBench [Hua+10d], SparkBench [Li+15], MRBS [SSB12b], CloudSuite [Fer+12a], LinkBench [Arm+13], and PigMix [Apa13c]).

• BigBench addresses the data variety (structured, semi-structured and unstruc- tured data), which is not the case with most current Big Data benchmarks (PigMix [Apa13c], CALDA [Pav+09b], and TPCx-HS [TPCb]).

• BigBench is an end-to-end benchmark with a unified data model that covers all important types of Big Data analytics (30 queries), unlike the micro-benchmark suites that consist of many separate domain specific workloads (HiBench [Hua+10d], SparkBench [Li+15], CloudSuite [Fer+12a] and CloudRank-D [Luo+12a]).

Based on the above advantages of BigBench over other benchmarks, the TPC chose it as a standard for Big Data benchmarking and named it TPCx-BB [TPC17]. However, as discussed in the Introduction section, BigBench has limitations in terms of being dependent on TPC-DS and its simplistic and unrealistic handling of semi-structured data. This paper proposes enhancements of BigBench through BigBench V2 that aims at fixing these limitations. The original TPCx-BB based on BigBench can be enhanced using BigBench V2 as well.

4.1.3 Data Model

BigBench V2 data model is a simple custom-made model representing user activities on an online retail store as shown in Figure 4.2. The data model meets the new workload queries described in section 4.1.5 and covers the variety of the data needed in Big Data.

The structured part of the model consists of six tables with their full schemas shown in Table 4.1. The user table captures data (name, state, country, etc.) for all registered users as well as users who visit the brick and mortar and online stores. Products offered by the retailer are stored in the product table. The product table has product name, its description, category, class information, price, and the lowest competitor price. The retailer online pages are described in the webpage table that has webpage URL, description, and type. The websale table stores sales information including customer/user who purchased a product, which product was sold, product

166 Chapter 4 ABench: Big Data Architecture Stack Benchmark quantity, and the date and time of the transaction. The storesale table is similar to the websale table with an additional field for store name. The directed arrows on Figure 4.2 indicate primary-foreign key relationship between the different tables. For example, websale has a many-to-one relationship with the product and the user tables. Note that the six structured tables are covered in BigBench with a bigger and more complex schema. For example, BigBench structured part (from TPC-DS) has separate tables for date and time, while BigBench V2 just uses a simple timestamp field to represent date and time. Also, BigBench V2 folded product categories into the product table while BigBench has separate tables for product categories and classes.

Fig. 4.2.: Data model

The semi-structured component is represented by web-logs capturing user clicks just like BigBench. However, unlike BigBench, the web-log entries are in the form of key-value pairs with no relational schema. It logs the activities (i.e., clicks) of a user while the user visits different webpages, handles shopping carts, or checkout products. These actions generate keys and values related to the user, product and webpage tables. Another set of random keys and their values are augmented to each web-log entry to represent real life scenarios where web-logs have large number of unknown keys. Section 4.1.4 explains how these two sets are generated with our new data generator.

An example of a click (i.e., one entry in web-logs) is shown below. In this example a user user1 at time t1 clicked on a webpage w1 that has information about product p1. The click has additional 100 random keys along with their values to simulate the large number of key-value pairs in real life web-logs.

...

4.1 BigBench V2: The New and Improved BigBench 167 Tab. 4.1.: Schema of the six structured tables

Table Columns user u_user_id u_name product p_id p_name p_category_id p_category_name p_price webpage w_web_page_id w_web_page_name w_web_page_type websale ws_transaction_id ws_user_id ws_product_id ws_quantity ws_timestamp storesale ss_transaction_id ss_store_id ss_user_id ss_product_id ss_quantity ss_timestamp store s_store_id s_store_name

168 Chapter 4 ABench: Big Data Architecture Stack Benchmark The third component of the data model is the unstructured product reviews text. Similar to BigBench, product reviews are represented as a table with a wide text field to hold the reviews as shown in Table 4.2. In addition to the review text, the table captures the user who did the review along with the product on which the review was submitted. It also has the user’s overall rating of the product.

Tab. 4.2.: Schema of product review table.

Table Columns productreview pr_review_id pr_product_id pr_rating pr_content

4.1.4 Data Generation

Fig. 4.3.: Data generation process flow

The data generator of BigBench V2 is a scalable synthetic data generator that meets the design and requirements of the data model described in Section 4.1.3. The data generator is based on a cardinality scale factor, similar to the way data is generated in TPC benchmarks. The data generation covers all the six relational tables: product, webpage, user, store, storesale, and websale. In addition, the data generator produces the key-value weblog and unstructured productreview in sync with the structured tables.

The cardinalities of user, storesale, websale and weblogs grow linearly with the scale factor. The product table is scaled sublinearly since in real life the number of new products does not grow proportional to the number of users. Data in the webpage table is assumed to be static and does not grow with the scale factor. Table 4.3 shows cardinality for some example scale factors. This data illustrates the linear, sub-linear, and constant growth of the different data sources. The exact cardinality formula for the different tables and web-logs will be included in a detailed and extended report with the full specifications of the benchmark.

4.1 BigBench V2: The New and Improved BigBench 169 Tab. 4.3.: Cardinality for various scaling factors

Data\Factor 1 100 1000 10000 webpage 26 26 26 26 product 1,000 1,900 4,063 10,900 user 10,900 109,900 1,009,900 10,009,900 store 100 105 150 600 web sale 143,880 1,450,680 13,330,680 132,130,680 store sale 59,950 604,450 5,554,450 55,054,450 product review 163,863 1,652,163 15,182,163 150,482,163 weblog 23,000,000 236,000,000 2,200,000,000 21,500,000,000

The data generator logic flow is shown in Figure 4.3. Boxes with colored fill represent final data sets (six tables, web-logs, and product reviews). The directed arrows indicate a primary-foreign key relationship between the different data stores. For example, websale has a many-to-one relationship with the product table. The data generator first produces the data for user, product, and webpage tables based on the scale factor. The entries in weblogs are produced in correlation to the data of user, product, and webpage tables. As shown in Section 4.1.5, web-logs analytics examine the user clicks in terms of sessions (i.e., sequence of clicks during a fixed amount of time). Some sessions lead to adding products to shopping carts and some are simply used for browsing. Sessions that involve shopping carts could be abandoned or may eventually lead to an actual sale. The data generator is driven by these different scenarios as illustrated in Figure 4.3. Actual sales are also captured in the websale table. Data generation for the storesale table is done independently but in sync with the data generated for other related tables.

As mentioned before, in real life, clicks typically have large and unknown number of keys. The keys are broken down into two sets. One set includes keys that are normally needed by queries such as price, timestamp, etc. The other set is simply for arbitrary keys with random values to support the large and unknown key requirements of late binding. Our data generator produces the web-logs data in JSON format since it is a commonly used format for key-value data.

In BigBench, the unstructured product reviews text generation is based on a Markov chain process using real life product reviews with limited number of categories. This method is not scalable since it is impossible to find real reviews on demand to support higher values of the scale factor. To address this issue, BigBench V2 data generator produces product reviews synthetically in conjunction with the user and product tables. The cardinality of these reviews grows linearly with the scale factor similar to the user table.

The relationships between the different boxes in Figure 4.3 (e.g., the average number of sessions per user, the average number of clicks per session, the percentage of clicks leading to a shopping cart, etc.) are captured in a configuration file and can be adjusted to control the data generation. The average number of random key-value

170 Chapter 4 ABench: Big Data Architecture Stack Benchmark pairs added to clicks is also captured in the configuration file. We plan to offer public access to the binaries of the data generator along with a detailed data model.

4.1.5 Workload

One of the enhancements in BigBench V2 over BigBench is de-emphasizing the structured part and increasing the share of semi-structured data in the workload. This is accomplished by replacing all 11 complex TPC-DS queries that involve neither semi-structured nor unstructured part of the data model. The new queries are developed to mainly go after the semi-structured part of the new model. Queries 20 and 24 are also replaced since the data model does not have sales return. The remaining 17 queries are superficially rewritten to reflect new table and column names. BigBench V2 new queries answer the business questions listed below and their HiveQL code is included in the Appendix D. To simplify referencing and tracking queries, we kept the query numbers of the original 17 BigBench queries as is. The new 13 queries re-use the numbers of the deleted queries which are : Q5, Q6, Q7, Q9, Q13, Q14, Q16, Q17, Q19, Q20, Q21, Q22, and Q23.

• Q5 : Find the 10 most browsed products.

• Q6 : Find the 5 most browsed products that are not purchased.

• Q7 : List users with more than 10 sessions. A session is defined as a 10-minute window of clicks by a user.

• Q9 : Find the average number of sessions per registered user per month. Display the top ten users.

• Q13 : Find the average amount of time a user spends on the website.

• Q14 : Compare the average number of products purchased by users from one year to the next.

• Q16 : Find the top ten pages visited.

• Q17 : Find the top ten pages visited on a certain day (such as Valentine’s Day).

• Q19 : Find out the days with the highest page views.

• Q20 : Do a user segmentation based on their preferred shopping method (online vs. in-store).

• Q21 : Find the most popular web page paths that lead to a purchase.

• Q22 : Show the number of unique visitors per day.

• Q23 : Show the users with the most visits.

4.1 BigBench V2: The New and Improved BigBench 171 The final 30 BigBench V2 queries pretty much answer the same type of business questions covered in the original BigBench. Table 4.4 summarizes the business questions breakdown of the queries in BigBench V2 and BigBench side by side. More background information about the source of business questions in BigBench and BigBench V2 can be found in [Gha+13a] and [Man+11]. The table also shows the technical breakdown in terms of data source and types of queries. In terms of query processing, BigBench V2 has emphasis on the combination of procedural and declarative more than declarative and procedural alone like the case in BigBench. The majority of queries (60%) in BigBench were applied on the structured part on the expense of the semi-structured and unstructured data. In retail business, semi-structured data (capturing online user experiences and interactions) is normally more important than product reviews. On that basis, we applied most of the new queries in BigBench V2 on the web-logs.

Tab. 4.4.: Technical and business query breakdown

Business Cat- BigBench BigBench V2 egory No. of queries Percentage No. of queries Percentage Marketing 18 60.0% 20 69.0% Merchandising 5 16.7% 3 10.3% Operations 4 13.3% 2 6.9% Supply chain 2 6.77% 1 3.3% New business mod- 1 3.3% 4 13.8% els Query Type BigBench BigBench V2 No. of queries Percentage No. of queries Percentage Declarative 10 33.3% 7 24.1% Procedural 7 23.3% 4 13.3% Declarative & Proce- 13 43.3% 19 65.6% dural Data Source BigBench BigBench V2 No. of queries Percentage No. of queries Percentage Structured 18 60.0% 5 16.7% Semi-Structured 7 23.3% 20 66.7% Unstructured 5 16.7% 5 16.7%

One of the key contributions in BigBench V2 is mandating late binding. Web-logs cannot be accessed as a table by the workload queries and upfront parsing of the web-logs is not allowed. Only at run time, on a query by query basis, the system conducting the benchmark can know the keys needed from the web-logs.

There are different methods for implementing late binding. BigBench V2 does not require any specific one. In a high level, pulling keys at run time can be done through non-streaming and streaming methods. On one hand, non-streaming methods scan

172 Chapter 4 ABench: Big Data Architecture Stack Benchmark all records/entries of the web-logs, extract the keys, and make them available (for instance, through a table) to the rest of the query execution. Streaming methods, on the other hand, perform key extraction one record at a time (buffering can be used as an optimization) and the result is passed to the execution engine as a tuple/row. For example, if the web-logs are involved in a join then streaming provides one row at a time for the join execution in a data flow fashion. Streaming provides more parallelism and less memory requirements. However, materialized results from non-streaming methods can be re-used across different queries. Parsing web- logs (streaming or non-streaming) can be done natively by the Big Data software solution or can be done through an external tool. For example, SparkSQL and Drill have native support for JSON and can parse web-logs directly. In contrast, Hive needs an internal or external user-defined function (UDF) to parse web-logs. Section 4.1.6 provides concrete examples of these different options done through our experiments.

4.1.6 Proof of Concept

Similar to BigBench, BigBench V2 is technology agnostic and can be implemented on different engines. The official TPCx-BB [TPC17], which is based on BigBench, is implemented using HiveQL with Hive [Hiva] being the most commonly used data warehouse engine on Hadoop. We implemented BigBench V2 in Hive as well and developed queries using HiveQL. The following section describes first our experimental setup and then discusses the implementation of the proof of concept. The actual experiments of the 30 queries and their results are also discussed. Finally, the subsection on other engines shows experiments of 3 queries on SparkSQL and Drill to illustrate different ways late binding can be applied.

Experimental Setup

We performed all BigBench V2 experiments, presented in the following section (4.1.6), on our experimental system. This section gives a brief description of our test cluster.

Hardware: We use a dedicated cluster consisting of 4 nodes connected directly through a 1GBit Netgear switch. All 4 nodes are Dell PowerEdge T420 servers. The master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs - each with 6 cores, 32GB of main memory, and 1TB hard drive. The 3 worker nodes are equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and, 4x 1TB (SATA, 7.2K RPM, 64MB Cache) hard drives. More detailed specification of the node servers is provided in the Appendix B (Table B.1 and Table B.2).

Software: The Ubuntu Server 14.04.1 LTS was installed on all 4 nodes, allocating the entire first disk. The Cloudera Distribution of Hadoop (CDH) versions 5.5.1 with Hive 1.1.0 were used in all experiments. The total storage capacity of the cluster is 13TB of which 8TB are effectively available as HDFS space. Due to resource limitations (only 3 worker nodes) of our setup, the cluster was configured to work with replication factor of two.

4.1 BigBench V2: The New and Improved BigBench 173 Data Generation and Loading: We ran our new BigBench V2 data generator with scale factor set to 1. The generated data files corresponding to each BigBench V2 table for scale factor (SF) 1 are outlined in Table 4.5. Using a HiveQL script, we created the data model schema and loaded the 6 structured tables, the product reviews and the external web-logs table in Hive. The data loading times per table are also provided in Table 4.5.

Tab. 4.5.: Data size and loading time

Scale Factor 1 Table Name Data Size Loading Time (sec.) user 420 KB 16.633 product 84 KB 13.162 product review 6.44 MB 15.196 web log 19.7 GB 0.13 web page 4 KB 13.051 web sale 10.1 MB 16.104 store sale 10.4 MB 15.031 store 8 KB 13.878 Total: 19.8˜ GB 103.185

Implementation

The 6 structured tables and product reviews are defined as Hive tables. The HiveQL definition for the table user is shown below as an example. The field delimiter defines how the fields are separated in the text file, generated by the data generator. The location attribute describes the physical location of the HDFS file.

Code 4.1: Table User 1 DROP TABLE IF EXISTS user; 2 CREATE TABLE user 3 (u_user_id bigint, 4 u_name string 5 ) 6 ROW FORMAT DELIMITED FIELDS TERMINATED BY ’|’ 7 STOREDASTEXTFILE 8 LOCATION ’hdfsDataPath/user’;

The web-logs are produced in JSON format as mentioned in section 4.1.4 and we capture it in HDFS as a file called clicks.json. We define an external table web-logs with a single text field that holds the JSON data. The definition of the Hive table web-logs is shown below.

Code 4.2: Table Web-Logs 1 CREATEEXTERNALTABLEIFNOTEXISTS 2 web_logs (line string) 3 ROW FORMAT DELIMITED LINES TERMINATED BY ’\n’

174 Chapter 4 ABench: Big Data Architecture Stack Benchmark 4 STOREDASTEXTFILE 5 LOCATION ’hdfsPath/web_logs/clicks.json’;

As mentioned in Section 4.1.5, there are multiple ways to implement late binding and in Hive this can be done using internal or external UDFs. The purpose of this proof of concept, however, is to show that BigBench V2 queries can be easily implemented in Hive regardless which implementation is more efficient. In our Hive implementation we used the internal json_tuple user-defined table function. It accesses the external web_logs table with the help of lateral_view_syntax [Hivb] and extracts keys from the JSON records. For example, Q16, defined in Section 4.1.5, uses the json_tuple UDTF to extract only the wl_webpage_name key from each JSON record:

Code 4.3: Q16 HiveQL 1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 web_logs 6 lateral view 7 json_tuple ( 8 web_logs.line, 9 ’wl_webpage_name’ 10 ) logs as wl_webpage_name 11 where 12 wl_webpage_name is not null 13 group by wl_webpage_name 14 order by cnt desc 15 limit 10;

There are other alternative internal UDFs like the get_json_object UDF that can be used for parsing of JSON records in Hive. An external way to access JSON files can be implemented using Hive Streaming in combination with Python scripts.

BigBench and TPCx-BB used Python scripts to implement the procedural constructs needed in the workload. The most common procedural constructs are: sessionize which identifies user sessions and path which performs path analysis. Using Python scripts is not only inefficient as an external function to Hive but is also complex since each usage of sessionize or path requires a custom written script. To avoid this complexity in BigBench V2 new queries, we implemented sessionize and path with native Hive UDF functions. We used these new and general UDFs in all relevant queries (new and old) of BigBench V2.

Hive Experiments

Apache Hive [Hiva; Thu+09] is the engine in which we implemented all 30 BigBench V2 queries and performed experiments with scale factors (SF) 1. The execution times (in seconds) of all queries are shown in Figure 4.4 with label Late Binding. Note that this data set is good enough as a proof of concept and further experiments

4.1 BigBench V2: The New and Improved BigBench 175 Time (seconds) 100 200 300 400 500 600 700 0

01 21 41 61 81 02 22 42 62 82 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 118

309 213 312 216 599 404 205 164 490

310 Hive on in Seconds) Runtimes V2 (Query BigBench for Table Pre-parsed vs. Binding Late 369 280 349 248 345 261 80 i.4.4. : Fig.

137 Late Binding Late 218

iBnhV ierslsfrSF1 for results Hive V2 BigBench 187 295 202 341 249 137

Pre-parsed Table Pre-parsed 251 164 260 166 111

354 239 202

327 237 302 197 296 180 424 362 173

121

60

572

295 200 312 211

176 Chapter 4 ABench: Big Data Architecture Stack Benchmark Tab. 4.6.: Late binding vs. pre-parsed table Hive for SF1 4.1

Query Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q12 Q13 Q14 Q16 Q17 Q19 Q21 Q22 Q23 Q24 Q29 Q30 iBnhV:TeNwadIpoe BigBench Improved and New The V2: BigBench Late binding (sec) 309 312 599 205 490 369 349 345 218 295 341 251 260 354 327 302 296 424 295 312 Pre-parsed table (sec) 213 216 404 164 310 280 248 261 187 202 249 164 166 239 237 197 180 362 200 211 Overhead (sec) 96 96 195 41 180 89 101 84 31 93 92 87 94 115 90 105 116 62 95 101 Overhead % 45 44 48 25 58 32 41 32 17 46 37 53 57 48 38 53 64 17 48 48 177 with bigger scale factors are planned as future work. Overall, execution times vary which shows different complexities of these queries.

To investigate the overhead of late binding, we tried some of the queries to access a table that represents a pre-parsed form of the JSON data. Our data generator has an option to generate a table for web-logs which we used to produce a Hive table called pre-parsed_web_logs. The execution time of the modified queries for SF1 is labeled Pre-parsed Table and also shown in Figure 4.4. Note that, Q1, Q10, Q11, Q15, Q18, ,Q20, Q25, Q26, Q27, and Q28 do not involve late binding. The execution time of these queries is the same with and without late bidning. Additionally, Table 4.6 shows the execution time in seconds for the queries using the late binding approach and the respective modified queries using the web-log pre-parsed table. The difference in execution time illustrates the actual late binding overhead. The overhead ranges between 17% and 64% depending on the query and is on average around 43%. The total execution time of all queries with late binding is around 140 minutes and around 107 minutes with pre-parsed web-log table, which implies an overall overhead of around 23%.

Other Engines

In addition to our proof of concepts on Hive, we looked at other popular Big Data engines in order to study different ways of implementing late binding. For this purpose, we looked at three queries (Q16, Q22, and Q23) in SparkSQL and Drill. Our comparison does not try to find which engine is better. Rather, it shows different alternatives of implementing late binding and handling key-value semi-structured data.

Apache Spark [Spa17; Zah+12b] has became a popular alternative to the MapRe- duce framework, promising faster processing and offering advanced analytical capabilities by SparkSQL [Arm+15a]. It natively supports HiveQL and can directly access the Hive metastore. This allowed us to execute three Hive queries (Q16, Q22, and Q23) without any modifications. Using the latest Spark version 2.0.0, the queries were run with SF1 on the pre-loaded Hive metastore. Similar to the Hive experiments, we executed both the late binding and the pre-parsed web-log table HiveQL implementations of the three queries. The execution times are provided in Table 4.7 and shown in Figure 4.5. The difference between implementations is labeled as Overhead. It ranges between 54% and 66%, which turns out to be very similar to the Hive overhead. For Q22 and Q23, both Hive and Spark achieve the same late binding overhead. As mentioned in Section 4.1.5, SparkSQL offers native JSON support. It can automatically infer the JSON schema through the use of the org.apache.spark.sql.json library, in which case no UDFs are required. However, for the sake of simplicity we leave this internal SparkSQL comparison for a future study.

Apache Drill [Dri19; HN13] is a columnar, schema-free SQL query engine that uses a JSON data model to enable queries on complex and nested data stored in Hadoop, NoSQL, or cloud storage. In comparison to Hive and SparkSQL, which rely on MapReduce and Spark for the data processing, Drill has its own optimizer

178 Chapter 4 ABench: Big Data Architecture Stack Benchmark Tab. 4.7.: Late binding overhead for other engines

Engine SparkSQL Drill

Query Q16 Q22 Q23 Q16 Q22 Q23 Late binding (sec) 232 234 231 223 249 266 Pre-parsed table (sec) 140 152 142 66 67 80 Overhead % 66 54 63 238 272 233

Late Binding vs. Pre-parsed Table in BigBench V2 (Seconds) for SF1 350 300 250 197 180

200 164 152 142 150 140 80 67 100 66 Time (seconds) Time 50 232 251 223 234 302 249 231 296 266 0 Spark SQL Hive Drill Spark SQL Hive Drill Spark SQL Hive Drill Q16 Q22 Q23

Late Binding Pre-parsed Table

Fig. 4.5.: BigBench V2 engines results for SF1

that automatically restructures a query plan to leverage its internal processing capabilities. Using Drill version 1.7.0 installed on all four cluster nodes, we executed the same three queries. Unlike SparkSQL, Drill does not support HiveQL and we implemented the queries using the Drill’s native JSON support (Section 4.1.5). Query implementations were tested using late binding and pre-parsed Hive table. In fact, the only difference between the two implementations is in the from statement. In the case of late binding, it directly references the JSON file in HDFS. For the pre-parsed Hive table, it references the table in Hive as shown below for Q16.

Code 4.4: Q16 Drill 1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 /* using late binding */ 6 hdfs.‘/hdfs_path/clicks.json‘ 7 /* using pre-parsed Hive table */ 8 /* hive.bigbench.‘pre-parsed_web_logs‘ */ 9 where 10 wl_webpage_name is not null 11 group by wl_webpage_name 12 order by cnt desc 13 limit 10;

The execution times on Drill are shown on Table 4.7 and Figure 4.5. Interestingly, Drill performs very similar to SparkSQL in the late binding experiments, whereas it

4.1 BigBench V2: The New and Improved BigBench 179 is 2-3 times faster than SparkSQL for the pre-parsed table queries. This results in a much greater overhead between 233% and 272% caused by the run-time parsing of the JSON file. In other words, the late binding overhead in Drill is almost 4 times bigger than the one observed in Hive and SparkSQL.

In summary, our proof of concept work proves that BigBench V2 is easy to implement as a self-contained benchmark with all required components. The benchmark is executed fully on Hive and partially on SparkSQL and Drill. The experiments on these three systems illustrate varying methods of late binding implementations and their corresponding overhead.

4.1.7 Conclusions & Future Work

In this paper, we presented BigBench V2 - a major rework of BigBench data model and generator. The new data model and its corresponding generator reflect real life Big Data simple data models and late binding requirement. We built a custom-made and scale factor-based data generator for all components of the data model. All 11 TPC-DS queries are removed from BigBench and replaced with new queries in BigBench V2. These new queries answer similar business questions, but focus on analytics on the semi-structured web-logs. We also implemented a rigorous and a complete proof of concept on Hive. The proof of concept illustrates the feasibility and self containment of the benchmark. It also highlights the cost of late binding and how that varies among different engines. We hope these results can be useful for providers to enhance their respective engines to efficiently implement late binding.

We plan to make the data generator and queries available for the public. We also plan to propose enhancing TPCx-BB using BigBench V2 and work on the necessary changes for the specification and final queries written in HiveQL. Such an extension to the TPCx-BB should be straightforward since TPCx-BB is already based on HiveQL.

180 Chapter 4 ABench: Big Data Architecture Stack Benchmark 4.2 Evaluating Hive and Spark SQL with BigBench V2

This section extends the proof of concept section from the BigBench V2 section 4.1.6. In particular it focuses on the different techniques to implement late binding using internal functions in Hive and Spark SQL. To better understand the performance advantages and disadvantages of the various implementations, we conduct multiple experiments on a subset of the 30 BigBench V2 queries, namely the one applying late binding.

4.2.1 Motivation

In the era of Big Data, the growing number of emerging technologies and new application requirements are motivating the need of a standardized end-to-end Big Data benchmarks. Arguably, BigBench (TPCx-BB) [Gha+13b; Bar+14] is the first attempt for such a standardized comprehensive Big Data benchmark. However, there are multiple shortcomings of the benchmark, which are addressed in the recently presented BigBench V2 benchmark [Gha+17a].

BigBench V2 abandons the TPC-DS data model in favour of a simple data model that still has the variety of structured, semi-structured, and unstructured data as the original BigBench data model. The difference is that the structured part has only six tables that capture necessary information about users (customers), products, web pages, stores, online sales and store sales. We developed a scale factor-based data generator for the new data model. The web-logs are produced as key-value pairs with two sets of keys. The first set is a small set of keys that represent fields from the structured tables like IDs of users, products, and web pages. The other set of keys is larger and is produced randomly. This set is used to simulate the real life cases of large keys in web-logs that may not be used in actual queries. Product reviews are produced and linked to users and products as in BigBench but the review text is produced synthetically contrary to the Markov chain model [MT09] used in BigBench [Gha+13b]. We decided to generate product reviews in this way because the Markov chain model requires real data sets which limits our options for products and makes the generator hard to scale.

For the workload queries, all 11 TPC-DS queries on the complex structured part are removed and replaced by simpler queries mostly against the key-value web-logs. The new BigBench V2 queries have only 5 queries on the structured part versus 18 in BigBench. This change has no impact on the coverage of the different business categories done in BigBench. In addition to the removal of TPC-DS queries, BigBench V2 mandates late binding [Bro13] but it does not impose a specific implementation of it. This requirement means that a system using BigBench V2 can extract the keys and their corresponding values per query at run-time. Other than the changes above, BigBench V2 is the same as BigBench including metric definition and computation.

This section investigates further the new features of BigBench V2 by evaluating the late binding capabilities of two popular SQL-on-Hadoop engines, namely Hive and

4.2 Evaluating Hive and Spark SQL with BigBench V2 181 Spark SQL. The experiments show the performance impact of using different JSON implementations on both engines.

4.2.2 Experimental Setup

This section describes all important aspects in setting-up of the experimental envi- ronment, starting with the hardware and software components, going through the query implementation and finally the data generation and loading times.

Hardware

We use a dedicated cluster consisting of 4 nodes connected directly through a 1GBit Netgear switch. All 4 nodes are Dell PowerEdge T420 servers. The master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs – each with 6 cores, 32GB of main memory, and 1TB hard drive. The 3 worker nodes are equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and, 4x 1TB (SATA, 7.2K RPM, 64MB Cache) hard drives. More detailed specification of the node servers is provided in the Appendix B (Table B.1 and Table B.2).

Software

The Ubuntu Server 14.04.1 LTS was installed on all 4 nodes, allocating the entire first disk. The Cloudera Distribution of Hadoop (CDH) versions 5.11.0 with Hive 1.1.0 were used in all experiments. The total storage capacity of the cluster is 13TB of which 8TB are effectively available as HDFS space. Due to resource limitations (only 3 worker nodes) of our setup, the cluster was configured to work with replication factor of two. Apache Spark 2.3.0 (09.05.2017) and Spark SQL were compiled and used throughout the experiments. The important YARN, MapReduce, Hive and Spark configuration parameters are taken from a previous experimental work presented in Section 3.4, where the parameter values are described in Table 3.40.

Query Implementation

The focus of the paper is to investigate the different late binding implementations in Hive and Spark SQL, which means that only 12 of the 30 BigBench V2 queries are relevant. These are Q5, Q6, Q7, Q9, Q12, Q13, Q16, Q17, Q19, Q22, Q23 and Q24.

Since Hive does not provide native parsing of JSON files and records, the HiveQL queries using late binding are implemented using user-defined functions (UDFs). The Hive queries in the experiments presented in Section 4.1.6 are implemented using the internal json_tuple user-defined table function. It accesses the external web_logs table with the help of lateral_view_syntax [Hivb] and extracts keys from

182 Chapter 4 ABench: Big Data Architecture Stack Benchmark the JSON records. An alternative Hive implementation was done using the internal get_json_object user-defined function, which is also parsing JSON records in Hive.

Spark SQL provides native JSON support and can automatically infer the JSON schema through the use of the org.apache.spark.sql.json library, in which case no UDFs are required. Additionally, all major Hive UDFs are supported in Spark SQL, which means that the exact query implementations using the json_tuple and get_json_object user-defined functions [Hua17] are also executable in Spark SQL.

All of the above described Hive and Spark SQL implementations are available in the BigBench V2 github repository2.

Data Generation and Loading

To evaluate better the different late binding implementations, the selected BigBench V2 queries are executed on six different data sizes. The data is generated using the scale factor parameter of the BigBench V2 data generator. An overview of the tested scale factors, together with the exact table sizes and their loading times, is presented in Table 4.8.

Tab. 4.8.: Data Sizes and Loading Times for the BigBench V2 Scale Factors

Scale Factor 1 Scale Factor 10 Scale Factor 30 Table Data Loading Data Loading Data Loading Name Size Time Size Time Size Time (sec.) (sec.) (sec.) user 256 KB 16.63 456 KB 19.19 925 KB 20.90 product 39 KB 13.16 46 KB 17.33 55 KB 18.85 product 6.9 MB 15.20 13 MB 13.40 25 MB 13.12 review web log 22 GB 0.02 40 GB 0.04 79 GB 0.05 web page 687 B 13.05 687 B 14.80 687 B 15.07 web sale 9.7 MB 16.10 18 MB 14.63 36 MB 15.10 store sale 10 MB 15.03 19 MB 15.20 38 MB 17.53 store 6.6 KB 13.88 6.2 KB 13.67 6.3 KB 12.92 Total: 22.03 GB 103.08 40.05 GB 108.27 79.1 GB 113.54 Scale Factor 60 Scale Factor 100 Scale Factor 200 Table Data Loading Data Loading Data Loading Name Size Time Size Time Size Time (sec.) (sec.) (sec.) user 1.6 MB 20.53 2.6 MB 21.00 4.9 MB 20.45 product 64 KB 20.31 72 KB 18.68 88 KB 18.68

2https://github.com/t-ivanov/BigBenchV2

4.2 Evaluating Hive and Spark SQL with BigBench V2 183 Continuation of Table 4.8 product 44 MB 14.13 69 MB 14.32 133 MB 15.40 review web log 139 GB 0.07 218GB 0.07 294 GB 0.07 web page 687 B 14.20 687 B 13.63 687 B 14.25 web sale 64 MB 17.24 101 MB 19.35 196 MB 20.42 store sale 66 MB 18.37 105 MB 20.44 205 MB 27.74 store 6.4 KB 11.92 6.5 KB 12.96 6.8 KB 13.29 Total: 139.2 GB 116.77 218.3 GB 120.45 294.5 GB 130.28 End of Table

It is important to mention that in all tested scale factors the web log table, repre- senting the web-logs entries from the clicks.json file, has the largest size starting from 22GB (for SF1) to 294GB (for SF200). The web-logs are stored as json format records and are accessed by all queries implementing late binding.

4.2.3 Hive Data Size Scaling

The section investigates the performance of the two different BigBench V2 late binding implementations in Hive with respect to data size scaling and performance comparison between the two user-defined functions.

Code 4.5: Q16 HiveQL json_tuple 1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 web_logs 6 lateral view 7 json_tuple ( 8 web_logs.line, 9 ’wl_webpage_name’ 10 ) logs as wl_webpage_name 11 where 12 wl_webpage_name is not null 13 group by wl_webpage_name 14 order by cnt desc 15 limit 10;

As explained in section 4.2.2, there are two late binding implementations in Hive. The first one using the json_tuple user-defined table function (UDTF) and the second one using the get_json_object user-defined function (UDF). Code Listing 4.5 provides an example with Q16 implementing json_tuple, whereas Code Listing 4.6 provides implementation with get_json_object for the same query.

The goal behind the data scaling evaluation is to understand how the different data sizes influence the two json parsing implementations in Hive. Table 4.9 (scale factors (SF) SF1, SF10 and SF30) and Table 4.10 (scale factors (SF) SF60, SF100

184 Chapter 4 ABench: Big Data Architecture Stack Benchmark and SF200) depict the average execution time (in seconds) from three query exe- cutions for all the 12 BigBench V2 queries utilizing late binding implemented with get_json_object and json_tuple. In both tables, the columns "json_tuple (seconds)" and "get_json_obj (seconds)" represent the average execution times for each query for the respective user-defined function. The column "∆ %" represents the time difference (in %) between the baseline (SF1) and one of the higher scale factors. For example, the json_tuple implementation of Q5 executes 18% longer for SF10 compared to baseline SF1. By comparing the query execution times of the two user-defined functions, it is clear that all queries (Q5, Q6, Q7, Q9, Q12, Q13, Q17, Q19, Q22 and Q24) perform faster with the json_tuple UDTF implementation, except Q16 and Q23, which are slightly faster with get_json_object UDF.

Code 4.6: Q16 HiveQL get_json_object 1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 ( 6 select 7 get_json_object(web_logs.line, 8 ’$.wl_webpage_name’) 9 as wl_webpage_name 10 from web_logs 11 ) logs 12 where 13 logs.wl_webpage_name is not null 14 group by wl_webpage_name 15 order by cnt desc 16 limit 10;

Looking at the queries scaling behaviour, one can identify two groups of queries with different behaviour patters. The bigger group of queries (Q6, Q7, Q9, Q13, Q16, Q17, Q19, Q22 and Q23) are scaling in a similar manner despite the use of different user-defined functions. Queries Q5, Q12 and Q24 are scaling with changing behaviour. On average the time difference listed in columns "∆ %" for the two different implementations ranges within 2% for the scale factors SF10 and SF30, within 10% for SF60 and within 25% for the scale factors SF100 and SF200. This behaviour clearly shows that the input data size (scale factor) has a major impact on the query performance. This observations can be further verified by looking at the Figures E.1 and E.2 (in appendix E) depicting the execution times for all queries and scale factors.

4.2.4 Spark SQL Data Size Scaling

The section investigates the performance of the three different BigBench V2 late binding implementations in Spark SQL with respect to data size scaling and per- formance comparison.

In Spark SQL are provided three different implementations. The first one using the spark native internal JSON support as shown in the example Code Listing 4.7 with query Q16. The other two are using the json_tuple UDTF (Code listing 4.5) and the get_json_object UDF (Code Listing 4.6) with the exact same Hive implementations.

4.2 Evaluating Hive and Spark SQL with BigBench V2 185 Tab. 4.9.: Average query times for the three scale factors (SF1, SF10 and SF30). The column ∆ % shows the time difference in % between the baseline SF1 and the other two SFs for the two Hive user-defined function implementations.

SF1 - baseline SF10 SF30 Query json_tuple get_json_obj json_tuple ∆ % get_json_obj ∆ % json_tuple ∆ % get_json_obj ∆ % (seconds) (seconds) (seconds) (seconds) (seconds) (seconds) Q5 223 222 262 18 278 25 402 80 443 100 Q6 542 552 938 73 945 71 1661 206 1685 205 Q7 470 484 702 49 725 50 1148 144 1193 147 Q9 472 474 684 45 687 45 1130 139 1147 142 Q12 236 241 282 19 306 27 419 77 457 89 Q13 409 415 629 54 647 56 1042 155 1091 163 Q16 276 276 464 68 453 64 810 194 778 182 Q17 290 293 487 68 485 66 867 199 863 194 Q19 384 384 645 68 645 68 1119 191 1130 194 Q22 330 335 559 69 569 70 983 198 998 198

Q23 292 289 491 68 482 67 852 191 832 187 ABench: Big Data Architecture Stack Benchmark Q24 456 465 596 31 633 36 1055 131 1145 146 Chapter 4 186 Tab. 4.10.: Average query times for the three scale factors (SF60, SF100 and SF200). The column ∆ % shows the time difference in % between the baseline SF1 and the other three SFs for the two Hive user-defined function implementations.

SF60 SF100 SF200 Query json_tuple ∆ % get_json_obj ∆ % json_tuple ∆ % get_json_obj ∆ % json_tuple ∆ % get_json_obj ∆ % (seconds) (seconds) (seconds) (seconds) (seconds) (seconds) Q5 603 171 680 206 875 293 978 340 1152 417 1304 488 4.2 Q6 2106 288 2189 296 2683 395 2856 417 3283 505 3536 541 vlaigHv n pr Q ihBgec V2 BigBench with SQL Spark and Hive Evaluating Q7 1461 211 1562 223 1829 289 1976 309 2164 360 2369 390 Q9 1444 206 1479 212 1776 276 1840 288 2099 344 2202 365 Q12 621 163 684 184 885 275 990 310 1144 385 1282 431 Q13 1336 226 1398 237 1659 305 1777 328 1979 384 2122 411 Q16 1007 265 990 259 1279 363 1266 359 1553 463 1544 460 Q17 1107 282 1119 282 1414 388 1454 396 1738 500 1790 511 Q19 1420 270 1437 274 1768 360 1800 369 2086 443 2126 454 Q22 1250 279 1291 286 1582 379 1638 390 1902 476 1989 494 Q23 1090 273 1076 272 1410 382 1398 383 1765 504 1764 510 Q24 1665 265 1791 285 2470 442 2684 477 3211 604 3483 649 187 Code 4.7: Q16 HiveQL spark_native 1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 spark_logs 6 where 7 wl_webpage_name is not null 8 group by wl_webpage_name 9 order by cnt desc 10 limit 10;

Similar to the data scaling evaluation with Hive, one can analyze the different late binding implementations on Spark SQL. Table 4.11 (scale factors (SF) SF1, SF10 and SF30) and Table 4.12 (scale factors (SF) SF60, SF100 and SF200) depict the average execution time (in seconds) from three query executions for all the 12 BigBench V2 queries utilizing late binding implemented with get_json_object, json_tuple and native_json. In both tables, the columns "json_tuple (sec.)", "get_json_obj (sec.)" and "native_json (sec.)" represent the average execution times for each query for the respective user-defined function or native JSON support. The column "∆ %" represents the time difference (in %) between the baseline (SF1) and one of the higher scale factors. For example, the json_tuple implementation of Q5 executes 19% longer for SF10 compared to baseline SF1. By comparing the query execution times of the three implementation it is clear that the native_json support in Spark SQL performs the best for all queries (Q5, Q6, Q7, Q9, Q12, Q13, Q17, Q19, Q22 and Q23) except query Q24, which experiences extra long execution times because of internal optimization problems when selecting the right filters in the operations. Furthermore, if we compare the two user-defined functions, it is clear that the json_tuple UDTF implementation performs faster for all queries (in particular Q16 and Q17) except for Q24 which performs slightly faster with the get_json_object UDF.

Looking at the scaling behaviour it is hard to identify any particular patterns between the different queries and their implementations. On average the time difference listed in columns "∆ %" for the three different implementations ranges within 5% for the smaller scale factors (SF10, SF30 and SF60) and within 20% for the higher scale factors (SF100 and SF200). Similar to the Hive executions, the experiments clearly show that the input data size (scale factor) has a major impact on the query performance. This observations can be further verified by looking at the Figures E.3 and E.4 (in appendix E) depicting the execution times for all queries and scale factors.

4.2.5 Conclusion

In this section we evaluated 12 BigBench V2 queries implementing late binding using three different techniques and executing them in Hive and Spark SQL. The main contribution of this work is that it demonstrates the benefits of the new BigBench V2 version by comparing the different late binding implementations. Figure 4.6 depicts the total execution times for the 12 BigBench V2 queries implemented with json_tuple and get_json_object and executed on both Hive and Spark SQL. Clearly,

188 Chapter 4 ABench: Big Data Architecture Stack Benchmark Tab. 4.11.: Average query times for the three scale factors (SF1, SF10 and SF30). The column ∆ % shows the time difference in % between the baseline SF1 and the other two SFs for the two Spark SQL user-defined function implementations.

SF1 - baseline SF10 SF30 Query json_ get_json native json ∆ % get_json ∆ % native ∆ % json ∆ % get_json ∆ % native ∆ % tuple _obj _json _tuple _obj _json _tuple _obj _json (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) 5 104 101 107 125 19 121 21 116 9 178 70 173 72 163 52 4.2 6 126 131 119 163 30 163 24 144 21 275 119 278 112 266 124 vlaigHv n pr Q ihBgec V2 BigBench with SQL Spark and Hive Evaluating 7 113 115 113 136 21 135 17 138 22 188 66 186 62 184 63 9 120 117 118 133 11 131 12 131 11 185 55 187 59 180 52 12 112 110 116 130 16 125 13 124 7 187 66 180 63 179 55 13 109 107 107 130 20 129 20 121 13 183 68 176 64 175 64 16 99 104 101 119 21 121 17 112 11 166 67 178 72 162 60 17 106 120 99 122 15 156 30 115 16 175 65 244 103 161 64 19 102 104 102 122 20 125 19 124 22 185 81 179 72 172 70 22 105 105 104 126 20 120 14 121 17 178 70 175 66 175 69 23 98 103 100 117 19 116 13 115 15 166 69 163 59 166 65 24 139 143 155 175 26 169 18 202 30 288 106 282 98 462 199 189 Tab. 4.12.: Average query times for the three scale factors (SF60, SF100 and SF200). The column ∆ % shows the time difference in % between the baseline SF1 and the other three SFs for the two Spark SQL user-defined function implementations.

SF60 SF100 SF200 Query json_ ∆ % get_json ∆ % native ∆ % json_ ∆ % get_json ∆ % native ∆ % json_ ∆ % get_json ∆ % native ∆ % tuple _obj _json tuple _obj _json tuple _obj _json (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) 5 253 142 250 149 244 128 367 252 351 248 342 220 458 339 471 368 431 302 6 433 244 437 234 418 252 617 391 638 387 598 404 819 552 828 532 789 565 7 268 137 270 136 261 131 371 228 367 220 358 217 468 314 471 311 452 300 9 266 122 260 122 258 118 378 216 377 222 356 201 481 302 472 302 462 290 12 273 143 254 130 251 117 370 229 349 216 346 199 474 322 448 306 447 287 13 267 146 266 148 248 132 368 239 365 240 350 227 468 331 478 345 455 325 16 240 143 259 150 234 131 335 238 364 251 334 231 439 344 466 349 417 313 17 259 144 373 211 238 141 366 245 538 349 341 246 459 333 704 487 445 351 19 289 182 283 171 280 175 404 294 396 280 393 287 526 414 523 401 494 386 22 268 156 254 141 243 134 359 243 350 233 350 237 450 330 448 325 448 332

23 244 149 247 141 238 138 341 248 345 236 328 227 450 359 443 331 430 328 ABench: Big Data Architecture Stack Benchmark 24 440 216 437 206 756 389 648 365 629 341 1145 640 842 504 839 488 1519 882 Chapter 4 190 the json_tuple UDTF performs fastest on both SQL-on-Hadoop engines, which even becomes more evident with the growing data sizes (scale factors).

Fig. 4.6.: BigBench V2 Hive and Spark SQL Total Execution Times (in seconds)

Based on our experimental evaluation we can conclude that:

• The BigBench V2 late binding feature can be used successfully to test the JSON support of SQL-on-Hadoop engines. Furthermore, we provide three different query implementations utilizing the Hive json_tuple and get_json_object user- defined functions and the native_json function in Spark SQL.

• Hive performs fastest using the json_tuple user-defined table function.

• Spark SQL performs fastest using the native_json function. Using the bench- mark one can identify bottlenecks in the functions like the one observed in Q24.

• Varying the input data sizes (scale factors) influences the query behaviour for both Hive and Spark SQL. However, the queries demonstrate different data scalability behaviours.

4.2 Evaluating Hive and Spark SQL with BigBench V2 191 4.3 Adding Velocity to BigBench

Abstract

BigBench standardized as TPCx-BB is a popular application benchmark that targets Big Data storage and processing systems. BigBench V2 addresses some of the Big- Bench limitations by introducing a new simplified data model, semi-structured web logs in JSON file format and new queries mandating late binding. However, it still covers only batch processing workloads and the Big Data velocity characteristic is not addressed. This work extends the BigBench V2 benchmark with a data streaming component that simulates typical statistical and predictive analytics queries in a retail business scenario. Our approach is to preserve the existing BigBench design and introduce a new streaming component that supports two data streaming modes: active and passive. In active mode, the data stream generation and processing hap- pen in parallel, whereas in passive mode, the data stream is pre-generated in advance before the actual stream processing. The stream workload consists of five queries inspired by the existing 30 BigBench queries. To validate the proposed streaming extension, the two streaming modes were implemented and tested using Kafka and Spark Streaming. The experimental results prove the feasibility of our benchmark design. Finally, we outline design challenges and future plans for improving the proposed BigBench extension. The paper is based on the following publications:

• Todor Ivanov, Patrick Bedué, Ahmad Ghazal, Roberto V. Zicari, Adding Veloc- ity to BigBench, in Proceedings of the 7th International Workshop on Testing Database Systems (DBTest) 2018, Houston, TX, USA, [Iva+18].

• Implementing real-time stream processing in the BigBench Benchmark - Patrick Bedué, Master Thesis 2017, [Bed17].

Keywords: BigBench, Streaming Benchmark, Big Data Benchmarking.

4.3.1 Introduction

In the era of Big Data, stream data processing has become a very important part of the application requirements together with the growing data volume and various data types. Velocity is one of the 3Vs characteristics of Big Data and represents a growing number of use cases (Social Media, Internet of Things, Health Care, Machine Generated Data, etc.) that challenge the current storage and processing systems. Stonebraker et. al [SÇZ05] identify eight properties that the large-scale stream processing systems should offer in order to address the requirements of the existing real-time applications. To cover these challenging properties a great number of new distributed streaming engines (Storm [Tos+14], Samza [Nog+17], Millwheel [Aki+13], Flink [Car+15], Spark Streaming [Zah+13], etc.) have recently emerged. They target to solve one or multiple of these requirements by utilizing different architectural approaches. Furthermore, new functionalities such as Structured Streaming [Str17c] in Spark SQL and Streaming in [Cal18], adapted by many engines such as Flink [Str17a], Drill [Dri19] and Samza SQL [Pat+16], offer the possibilities of streaming analytics in SQL-on-Hadoop engines. To better

192 Chapter 4 ABench: Big Data Architecture Stack Benchmark understand, evaluate and compare these systems new end-to-end application-level streaming benchmarks that represent real use case scenarios are necessary.

BigBench [Gha+13a] is one of the most popular end-to-end application-level Big Data benchmark which was adopted by TPC as TPCx-BB [TPC17]. It consists of 30 analytical workloads and stresses all the 3Vs characteristics of a Big Data platform. The latest BigBench V2 [Gha+17a] offers the following improvements over BigBench: 1) new simplified data model and data generator; 2) improved Big Data variety characteristic by utilizing structured, unstructured (text data) and semi-structured data (web logs in JSON); 3) replacement of all 11 complex TPC-DS queries with new ones mandating late binding.

However, the Big Data velocity characteristic is not really addressed in the current BigBench benchmark. The current implementation includes a periodic refresh model, called concurrent streams [TPC17], which defines the number of concurrent queries that will be executed in parallel during the processing phase. Obviously, this is very different from the execution of workloads on a continuous stream of data that represents the velocity characteristic.

To address this shortcoming, in this work we extend BigBench to periodically run queries on a data stream (web sales and web logs). In this way, we can incorporate techniques such as statistical and predictive analytics, and pattern detection on a subset (window) of the incoming data. Additionally, we introduced two streaming modes: passive and active. The goal of doing this is to make the velocity (streaming) more graduate and configurable for different workload scenarios. For example, passive mode is suitable for batch processing, where hours of history data are usually streamed. Whereas in the case of real-time monitoring and dashboards that need to be refreshed in less than three seconds, it is suitable to use the active mode to generate and stream the data while it is being processed.

In the rest of the paper, we refer to BigBench V2 as BigBench and the paper is organized as follows: Section 4.3.2 looks at related work and motivates the need of streaming extension in BigBench. Section 4.3.3 introduces the concept of streaming in BigBench and presents the methodology and the new architecture components. Section 4.3.4 describes the proof of concept implementation and initial results. Finally, Section 4.3.5 concludes the paper and outlines the future work.

4.3.2 Related Work

There are multiple benchmarks targeting different aspects and functionalities of the streaming systems:

Micro-benchmarks. StreamBench [Lu+14] is one of the first streaming bench- mark suites, containing seven micro-benchmark programs (Identity, Sample, Pro- jection, Grep, Wordcount, DistinctCount and Statistics) and four workload suites targeting different technical characteristics of the streaming systems. Recently, the HiBench [Hua+10b] benchmark suite has been extended with four streaming micro-benchmarks implemented on Spark, Storm, Flink and Gearpump. Similarly, SparkBench [Li+15] includes Twitter (real data) and PageView (synthetic data)

4.3 Adding Velocity to BigBench 193 workloads for stressing the Spark Streaming engine. All of the micro-benchmarks focus on testing of a particular streaming functionality and are not suitable for end-to-end application evaluation of a streaming system.

Application benchmarks. The Linear Road Benchmark [Ara+04] is an application benchmark simulating a toll system for the motor vehicle expressways of a large metropolitan area. It specifies a fictional urban area including such features as accident detection and alerts, traffic congestion measurements, toll calculations and historical queries. The benchmark reports an L-rating metric, which was used by the authors to compare relational database systems with stream data management systems. However, the benchmark is complex therefore not easy to implement and setup.

The AIM benchmark [Bra+15] is based on specific Huawei use case and consists of 300 Business Rules, 546 indicators (that means Entity Records of 3 KB) and seven different parameterized real-time analytics queries. The benchmark parameters are numbers of Entities (i.e. size of the main table), event rate for ingesting of new data, number of real-time analytics clients and query-mix. It was designed to compare the AIM system [Bra+15] with other similar systems, which include stream processing engines like Samza, Flink, Storm and Spark Streaming [Kip+17]. In the same time, a real-world application benchmark simulating an advertisement analytics pipeline has been developed by Yahoo [Chi+16]. It stresses the streaming capabilities of Storm, Flink and Spark. Similarly, the RIoTBench [SS16a] is a comprehensive benchmark suite for evaluating Distributed Stream Processing Systems (DSPS) that are hosting IoT applications. It includes 27 common IoT micro-benchmarks and four IoT application benchmarks composed of these tasks and using real data, but is currently available only for Storm. However, none of the above benchmarks integrates an end-to-end real-world scenario implementing a Big Data architecture integrating storage, batch and stream processing components.

4.3.3 BigBench Streaming Extension

In this paper we propose a streaming extension of the BigBench benchmark. The goal is to extend the retail business use case of the benchmark [Gha+13a; Gha+17b] with real-time processing component, which simulates representative statistical and predictive analytic queries that are executed in repetitive time windows. Currently many emerging workloads have similar type of behavior. For example, by monitoring logs from webservers in real-time, companies can react on changes in demand or sales within seconds. Measuring number of unique or returning visitors and the products they bought provides the possibility to immediately take countermeasures. Real-time information also could be used to detect failures of specific websites, before sales decrease significantly. However, real-time analysis requires a new type of data processing systems. To help the companies in choosing the best data streaming system for their business, new end-to-end standardized application benchmarks are required. Therefore, we build on top of the standardized BigBench (TPCx-BB [TPC17]) benchmark and preserve its specification by introducing new independent components that can easily be added to the existing one, as depicted on Figure 4.8 .

194 Chapter 4 ABench: Big Data Architecture Stack Benchmark Streaming Methodology

The BigBench data generator creates statically all tables’ data that serves as a complete data history warehouse. It is important to outline that per design the user actions are performed in sessions, which means that the web sales and web logs entries are generated in session window manner. They are also stored in the order of generation and not sorted according to the event timestamp.

The goal is to use both the web sales and web logs entries and simulate a stream of events, which will be sent in pre-defined groups. In order to achieve this, we first sort the data according to the event timestamp and then create a data window depending on the simulated scenario (fixed or sliding windows). In this way we can easily control the number of events in a window as well as the total number of windows. The other goal is to prepare the window chunks in advance and simulate real unbounded stream of data in a deterministic way. By doing this we can define the exact stream behavior and also validate and verify the query results outputted by the stream processing engine. Therefore, we define two types of simulated windows: fixed window and sliding window. Figure 4.7 illustrates examples for both window strategies using sample of web log entries.

(a) Fixed Window

(b) Sliding Window (x = 2*y)

Fig. 4.7.: Streaming Window Cases

4.3 Adding Velocity to BigBench 195 Fig. 4.8.: Current vs. New BigBench Components

Fixed windows have a static window size aligned to data that occurred during a specific period of time (e.g., hourly windows). Looking at Figure 4.7a the X-axis represents the data aggregation time (t1 to t8) and the Y-axis the incoming web logs stream for three sample web pages. The example depicts four fixed windows with colored points representing the user clicks on each of the three web pages. The Figure 4.7b shows an example of sliding windows, also called hopping windows, processing the same web logs stream. In this case, there are seven windows, illustrated in different colors, with window size (x) and window slide interval (y)(e.g., hourly windows, starting every 30 minutes), so that the window size is twice the size of the window slide (x = 2*y). When the window slide is less than the window size, windows may overlap and the same records are assigned to multiple windows. For example, data occurring between period t1 and t2 is assigned to both window 1 and window 2. Actually, the fixed window is a special case of the sliding window, where the window size is equal to the window slide.

Another important benchmark aspect is that the simulated velocity should be con- figurable. In practice the different application domains have different time require- ments ranging from seconds to hours depending on the workload scenario. For example, a real-time dashboard should be refreshed in less than three seconds, whereas for other systems an hour of data aggregation is considered as fresh data. To cover a broad range of application scenarios, we define two streaming modes for data ingestion: active and passive mode. The active mode streaming is designed to simulate real-time data streaming (in second ranges), whereas passive mode is simulating data ingestion and transformation on micro-batch processing (in hour ranges).

Design Overview

Figure 4.8 depicts the components of the BigBench benchmark in a technology agnostic manner that allows multiple architectural and implementation variations. The left part of the diagram represents the current standardized components: the Data Generator and the Batch Processing system. Additionally, there are read and write arrows representing read/write operations. For example, the current TPCx-BB kit [kit17] implements the 30 analytical queries in Hive and Spark SQL as Batch Processing systems and HDFS as Persistent Storage Layer.

196 Chapter 4 ABench: Big Data Architecture Stack Benchmark The right part of Figure 4.8 shows the new streaming components proposed in this paper. Initially, the Data Generator generates the workload and meta data, as defined in BigBench [Gha+17a], and stores it in the Persistent Storage Layer. The Stream Generator extends the Data Generator and enables two execution modes: passive and active. In passive mode, it reads the web logs and web sales data (ordered by time) from the Persistent Storage Layer, prepares the data in pre-defined chunks and writes them back to the Persistent Storage Layer. In active mode, the Stream Generator streams the pre-defined chunks of data to the Fast-access Layer. The Fast-access Layer stores the chunks in-memory and waits for the Stream Processing system to retrieve them. The Stream Processing system gets the data chunks and periodically executes the pre-defined workloads on the data. The workload results are again stored on the Persistent Storage Layer for further validation. The Batch and Stream Processing components can be implemented together using the same technology (for example Spark or Flink). The Batch Processing system executes in parallel the 30 analytical queries as defined in the BigBench specification. The scaling factors and the BigBench metric are also preserved.

Data and Stream Generator

The BigBench data generator generates the initial static data, which is then read by the Stream Generator to create a data stream for both Passive and Active Streaming Modes. These streams are then forwarded to the Stream Processing component via the Fast-access Layer.

Passive Mode: In this mode the data stream is prepared in advance and then it is consumed by the Stream Processing component. The web sales and web logs data files are read from the Persistent Storage Layer. Afterwards all timestamps are analyzed. The maximum and minimum timestamp values are used to calculate the time period between all logs within a data file. This information is used to split each file in chunks of records. The total number of chunks is determined by the total number of planned query executions, which is calculated based on the input configuration parameters (window size (x) and window slide (y)). For example, if the total number of executions is set to five and the web log records are read from a ten months period, then each table file will be divided into five pieces with each chunk covering a period of two months. Finally, each individual chunk will correspond to a window in the simulated benchmark stream.

Active Mode: The Data Generator creates the weblogs according to a chronological sequence, simulating a user buying a product, after he visited a website. Before a stream is created, the Stream Generator looks up the time and date of the next records in the BigBench data, compares their timestamps and decides which record to stream first, ensuring the internal consistency of the data. However, if the data needs to be sorted by the Stream Processing component the out-of-order property of the data can be preserved and streamed as it appears. In this mode the simulated data stream is generated and processed in parallel.

4.3 Adding Velocity to BigBench 197 Fast-access Layer

The Fast-access Layer can have different functionality depending on the executed streaming mode. In active mode individual records are appended in a queue and thereafter read by the Stream Processing component during runtime of the bench- mark, whereas in passive mode a complete set of records is buffered (in-memory) before the benchmark execution. It is then read at once whenever the workload is executed by the Stream Processing component.

Stream Processing

This component reads the continuous stream of data and executes the actual query workloads in predefined intervals. The component can be implemented in different technologies and the query details are described in the next section.

Workloads

All 30 BigBench queries remain unchanged and are executed in batch mode inde- pendently from the streaming extension. The streaming workload consists of five queries executed periodically on a stream of data. Four of the queries are taken from the existing 30 BigBench queries Q5, Q6, Q16 and Q22 and remain the same as they produce meaningful results when executed on a continuous stream of data. We renamed the queries as follows Q5 to QS1, Q6 to QS2, Q16 to QS3 and Q22 to QS4. Additionally we added a fifth query QS5 which is simple and checks how many products of certain type are sold. Similar to BigBench, all queries are defined in plain text as follows:

• QS1: Find the 10 most browsed products in the last 120 seconds.

• QS2: Find the 5 most browsed products that are not purchased across all users (or specific user) in the last 120 seconds.

• QS3: Find the top ten pages visited by all users (or specific user) in the last 120 seconds.

• QS4: Show the number of unique visitors in the last 120 seconds.

• QS5: Show the sold products (of a certain type or category) in the last 120 seconds.

With respect to the business behavior, (QS1, QS2, QS3 and QS4) perform statistical analytics, whereas (QS5) main task is monitoring and pattern detection. The term statistical analytics describes the aggregation of data by common mathematical operations like average, count, minimum and maximum. Pattern detection includes market basket analysis such as frequent item set identification.

198 Chapter 4 ABench: Big Data Architecture Stack Benchmark Metrics

The benchmark metric is the easiest way to evaluate and compare the performance of multiple systems stressed with the same benchmark workload. As mentioned, the BigBench metric [Gha+13a] (BBQpH) remains the same and only the new stream extension execution needs to be additionally measured. The easiest way to do this is to report the end-to-end streaming execution time (latency), starting from the Stream Generator and stopping at the point where the data result is produced. It is also possible to create a new combined metric including both batch and stream processing executions, but this is left for future work. This type of metric can be identified as black-box metric as it is independent of implementation and technology. At the same time, there are more common quantitative metrics such as throughput and resource utilization (CPU, disk, and memory) reported very often by benchmarks [Hua+10b; SS16a; Li+15; IB15e] that can be identified as white-box metrics. They are dependent on technology implementation and system architecture.

The validation of the streaming query results can be done in a similar way to the current one in BigBench. Since the data for each scale factor and streaming window is fixed and deterministic, each query result during a benchmark run can be stored persistently and validated when the benchmark execution is finished.

4.3.4 Proof of Concept

This section describes the prototype implementation of the proposed streaming extension. The first part explains the component implementation and the second presents preliminary experiments.

Implementation

As mentioned in the previous section the streaming extension was designed to simulate a continuous data stream. The data stream simulation can be configured to fit different streaming requirements ranging from seconds to hours via the window size, window slide and total runtime parameters. Figure 4.9 depicts the component implementations for both the active and passive streaming modes.

Active Streaming Mode

The Data and Stream Generation in active mode was done as described in subsec- tion 3.2.1. It simulates a real-time stream in parallel with the actual data processing. The Stream Generator is implemented in Spark and as a Persistent Storage Layer HDFS is used. The Fast-access Layer was implemented on top of Kafka [Kaf19]. Two topics (web sales and web logs) were created. The Streaming Generator, acting as a producer, continuously sends chunks of records to these topics. The records are persistently stored in the topics until the Stream Processing component, acting as a consumer, retrieves them for processing. The Stream Processing component is

4.3 Adding Velocity to BigBench 199 Fig. 4.9.: Active and Passive Streaming Architecture

implemented in Spark Streaming [Str17b] and implements five analytical queries as described in subsection 3.3. The workloads are executed against the incoming data stream and results are processed and stored in HDFS as shown on Figure 4.9. The application uses a pre-defined sliding window (y) to retrieve all recent messages that are not yet consumed from the Kafka topic. Additionally, static data (e.g. customer data) stored in HDFS can be read directly and joined with the stream data during the stream processing.

Passive Streaming Mode

Contrary to active mode, where stream generation and processing are in parallel, in passive mode stream generation and processing are done sequentially. The goal is to be less prone to performance bottlenecks and offer deterministic stream behavior, while measuring the performance benchmark metrics.

The Data and Stream Generation was done as described in subsection 3.2.1. The Streaming Generator generates all chunks of records, before starting the actual stream processing, and stores them in HDFS directories. In this case, the Fast-access Layer is implemented as an in-memory storage for the pre-generated file chunks, so that they are directly available to the Stream Processing application. The Stream Processing component is implemented again in Spark Streaming and covers all five queries. The only difference is that when initially starting the application, every chunk of web sale and web logs data are read from HDFS and made available in memory (Figure

200 Chapter 4 ABench: Big Data Architecture Stack Benchmark 4.9), afterwards the execution of the five queries starts. This repeats until all file chunks are processed.

Preliminary Experiments

To validate the proof of concept implementation, we performed initial experiments for the fixed and sliding window scenarios on an experimental setup. The system under test is equipped with Intel i7-5700HQ (2,7 GHz, 4 CPU cores), 8GB memory and 500GB hard disk. The host runs Windows 10 with Virtual Box 5.1.4 and virtual machine with 2 CPUs and 4GB memory running pre-configured Cloudera Quickstart VM 5.8.0 with updated Spark 2.1.0, Hadoop 2.7, Kafka 0.10.1.1 and Scala 2.11. The system under test was set up for both batch and streaming executions of all the BigBench workloads. In the subsections below experiments with active and passive streaming modes for the fixed and sliding window scenarios are presented.

In order to compare the different architecture modes we report only the query execution time and exclude the end-to-end streaming time because of space constraints. The query execution time which is the time measured between the query start and the query termination against a fixed or sliding window. The average query execution time is the average of the multiple query executions in a complete benchmark run. The throughput is measured by calculating the processed records (messages) per second.

Tab. 4.13.: Average Execution Times (in seconds)

Window Mode QS1 QS2 QS3 QS4 QS5 active 0,041 11,085 0,052 0,043 0,023 Fixed passive 0,040 14,836 0,045 0,039 0,025 Sliding passive 0,040 22,780 0,073 0,040 0,059

Fixed Window Scenario

The first set of experiments are simulating data streaming with a fixed window scenario (shown on Figure 4.7) for both active and passive modes. The fixed window size (x) was set to 120 seconds for total runtime of 1 hour resulting in 30 fixed windows. Due to time and resource constraints only the half of the BigBench web logs (around 10 GB) and web sales data (around 5MB), generated with scale factor 1, were used.

Table 4.13 summarizes the average execution times of all five queries for active and passive mode streaming on fixed window. QS2 has the longest execution time for both streaming modes as it performs join between two views, contrary to QS5 which performs simple lookup and results with the fastest execution time. The throughput in active mode was on average 19 records per seconds, whereas in passive mode was on average 3002 records per seconds (170 times faster than active mode).

4.3 Adding Velocity to BigBench 201 Tab. 4.14.: Summary of the Active and Passive Mode Features

Features Active Mode Passive Mode Fast-access Layer Kafka In-memory Throughput Low High Processing Type Real-time Batch Performance Metrics Inaccurate Accurate Scalability Complex Simple Streaming and Processing Parallel Sequential

Sliding Window Scenario

The second set of experiments simulates a sliding window scenario (shown on Figure 4.7) for passive mode. The sliding window size (x) was set to 240 seconds, the window slide interval (y) was set to 120 seconds for total runtime of 1 hour resulting in 30 sliding windows. In contrast to the fixed windows experiments, the sliding windows are overlapping and there are duplicate records in two subsequent windows, which increases the size of the transmitted data. The experimental results presented in Table 4.13 show that QS2 takes the longest (2x times slower than the fixed window case, because of the overlapping data), whereas QS1 and QS4 are the fastest (to be investigated in future work). The throughput in this case was on average 5901 records per second, as expected due to the data duplication in the sliding windows.

4.3.5 Conclusions and Future Work

This paper presents a first concept for a stream processing extension of the Big- Bench benchmark. Our approach proposes configurable active and passive streaming modes in order to cover the different streaming requirements (ranging from seconds to hours) in the end-to-end application benchmark. Furthermore, the extension supports fixed and sliding window streaming to better address the common data streaming use cases. Our experiments showed that the proof of concept implemen- tation produces feasible results and the lessons learned are summarized in Table 4.14. However, there are still many aspects in the design, and implementation that need to be improved in future work. Support for sliding windows in active mode, out-of-order record processing within and outside of a window, and parallel query execution still need to be implemented. More validation experiments on a large-scale cluster with different active and passive mode architectures and configurations need to be done. The performance gap between late binding and pre-parsed tables in the streaming extension needs to be investigated. The five queries need further evaluation and improvement in order to cover more techniques such as clustering, pattern detection and machine learning to better represent the common streaming workloads. Last but not least, new implementation with streaming technologies such as Flink and Samza will demonstrate the real benefits of the benchmark especially when comparing integrated data storage and streaming architectures.

Acknowledgements

202 Chapter 4 ABench: Big Data Architecture Stack Benchmark Special thanks to Karsten Tolle, Matteo Pergolesi and Sead Izberovic from the Frankfurt Big Data Lab, for their valuable feedback.

4.3 Adding Velocity to BigBench 203 4.4 Exploratory Analysis of Spark Structured Streaming

Abstract

In the Big Data era, stream processing has become a common requirement for many data-intensive applications. This has lead to many advances in the development and adaption of large scale streaming systems. Spark and Flink have become a popular choice for many developers as they combine both batch and streaming capabilities in a single system. However, introducing the Spark Structured Streaming in version 2.0 opened up completely new features for SparkSQL, which are alternatively only available in Apache Calcite.

This work focuses on the new Spark Structured Streaming and analyses it by diving into its internal functionalities. With the help of a micro-benchmark consisting of streaming queries, we perform initial experiments evaluating the technology. Our results show that Spark Structured Streaming is able to run multiple queries successfully in parallel on data with changing velocity and volume sizes. The paper is based on the following publications:

• Todor Ivanov and Jason Taafe, Exploratory Analysis of Spark Structured Streaming, in Proceedings of the 4th International Workshop on Performance Analysis of Big data Systems (PABS 2018), April 9th, Berlin, Germany, 2018, [IT18].

• Extending BigBench using Structured Streaming in Apache Spark - Jason Taaffe, Master Thesis 2017, [Taa17].

Keywords: Spark Structured Streaming, Spark, Big Data Benchmarking.

4.4.1 Introduction

Big Data is growing in both volume and velocity. The combination of both creates data streams. As is often the case, different disciplines use different definitions but a shared characteristic among these definitions includes: the real-time or near real- time nature of data arrival [GÖ03]. The data arrives in a continuous stream opposed to certain intervals and is unbounded in nature, hence the stream potentially never ends [Lu+14; Bab+02; Aki+15]. With the rise of the Internet of Things more data streams will be created, which will push the boundaries even further. Due to potentially millions of sensors constantly sending data in mere fractions of a second the torrent of data could reach between 102 and 105 messages per second. As noted by Shukla et al. [SCS17], Twitter by comparison receives around 6000 tweets per second. Since analysis takes place in-memory, it is not feasible to store the data on disks or other external data storage devices [Win+16].

Micro-batching is a hybrid concept , with the main idea being ”[...] to treat the stream as a sequence of small batch chunks of data. On small intervals, the

204 Chapter 4 ABench: Big Data Architecture Stack Benchmark incoming stream is packed to a chunk of data and is delivered to the batch system to be processed.”[Sha14]. Spark Streaming [Str17b] utilises this approach by limiting the batch size to keep latency at an acceptable level [Win+16]. Other hybrid concepts follow similar approaches of combining batch processing with streaming, e.g. the Lambda Architecture [MW15] and the Kappa Architecture [Kre14]. On the contrary, Flink [Fli17] offers native streaming support, which is also used to implement batch operations. An important drawback of both Spark Streaming and Flink approaches is that you have to implement and compile your streaming application code, before being able to execute it. This restricts the technology usability and development efficiency. However, recently in Spark version 2.0, Spark Structured Streaming [Zah16; Gui18], which enhances the SparkSQL [Arm+15b] engine with streaming capabilities was introduced. The goal of Structured Streaming is to make it easier to create streaming applications, by extending SparkSQL, so that the user no longer has to worry about the details of implementing streaming and can focus on the results, while offering strong fault-tolerance and consistency. Similar approach follow the Flink Streaming [Str17a], which uses Apache Calcite [Cal18] as an underlying engine that provides the SQL streaming capabilities.

This work focuses on evaluating the new features of Spark Structured Streaming by using queries inspired by the BigBench V2 [Gha+17a] benchmark and implemented in a streaming context. Our experiments showed multiple pros and cons of the current Structured Streaming technology.:

• The more queries that are run in parallel, the longer these need to be com- pleted.

• An increase in file sizes also results in longer completion times, although this relationship was not always consistent in every file bracket.

• Tentative evidence points towards heap memory being the bottleneck, while CPU utilization decreased at larger file sizes.

The remaining paper is structured as follows: Section 4.4.2 describes the main features of Spark Structured Streaming. Section 4.4.3 looks at related benchmarks and studies followed by Section 4.4.4, which presents the micro-benchmarks used in the evaluation. Section 4.4.5 analyses the experimental results and the final Section 4.4.6 concludes the paper.

4.4.2 Spark Structured Streaming

Structured Streaming is a stream processing engine, which was built on top of the Spark SQL engine with fault-tolerance and scalability in mind [DS16]. It was first added in Apache Spark 2.0 to build and simplify the development of continuous and real-time Big Data applications, by combining batch and streaming computation [Gui18]. It uses the already existing Dataset and DataFrame APIs and is intended to complement Spark Streaming in the long-term [Zah16]. One of the main benefits is that users no longer have to concern themselves with the details of streaming as this is handled by the new streaming architecture.”The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. [...] Every

4.4 Exploratory Analysis of Spark Structured Streaming 205 Fig. 4.10.: Structured Streaming Models [Gui18]

data item that is arriving on the stream is like a row being appended to the Input Table.” [Gui18]. The left part of Figure 4.10 depicts this process in a graphic manner. Depending on a trigger interval set by the user, the input table is updated with new rows on which queries are run. The output of these queries is saved in a results table. This is illustrated in the right part of Figure 4.10. It shows that every second the input table is updated with new data and afterwards a query is run, the results of which are then saved in the result table. Following this, the user has several options for saving the content of the result table to an external storage.

Structured Streaming combats certain inherent weaknesses in streaming systems such as inconsistency when processing records that can lead to nonsensical results, fault tolerance both inside and outside the engine, and out-of-order data. Structured Streaming offers a guarantee: ”at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.” [Zah16]. It also supports event time aggregation to enable the processing of out of order data, which is very similar to grouped aggregations.

Spark collects a large assortment of metrics once the application is running, of which only a small subset is relevant to the benchmark that we ran. Table 4.15 below lists the names and a definition of each relevant metric with an explanation taken directly from the official Apache Spark Scala source files in Github [Cod18].

During the benchmark implementation process, we identified a problem with the processingRate-total file. Due to an error in the MetricsReporter scala file, which is used to create the CSV files, the files inputRate-total and processingRate-total were reporting the same values. After opening a ticket and submitting a fix to the issue, this bug was fixed in the latest release (Bug [SPARK-22052]).

Figure 4.11 illustrates how the metrics relate to each other and to which point in the streaming process they correspond. The figure assumes processing time is set to 100 seconds to showcase the difference between processing time and trigger execution.

206 Chapter 4 ABench: Big Data Architecture Stack Benchmark Tab. 4.15.: Metrics

Name Definition Description Latency triggerExecution - The Describes the amount of time amount of time taken to needed to perform the SQL perform various opera- query operations during one tions in milliseconds. trigger interval. InputRate-total numRecords / input- Describes how many rows TimeSec were loaded per second be- tween the start of the last trig- ger and the start of the current trigger. ProcessingRate - numRecords / process- Describes how many rows total ingTimeSec were processed per second during the start of the current trigger and the end of the cur- rent trigger. InputTimeSec currentTriggerStart The period of time between Timestamp - the start of the current trigger lastTriggerStartTimestampand the start of the last trigger. ProcessingTime - currentTriggerEnd The period of time between Sec Timestamp - current- the beginning and the end of TriggerStartTimestamp a trigger period. NumInputRows NumRecords The number of rows in a batch.

Fig. 4.11.: Trigger Process

4.4 Exploratory Analysis of Spark Structured Streaming 207 Changes in trigger processing time have different effects on the calculation of the metrics. ProcessingTimeSec is defined as currentTriggerEndTimestamp - currentTrigger- Start - Timestamp, effectively the duration of the Trigger Execution. Setting trigger processing time to a specific value thereby does not affect this metric as the Trigger Execution time is not affected by the user, merely the amount of waiting time is altered by setting trigger processing time. However, InputTimeSec is affected by a change in trigger processing time as it is defined as currentTriggerStartTimestamp - lastTriggerStartTimestamp, thereby a decrease or increase in waiting time will alter this metric. Assuming no specific trigger processing time is set, meaning as soon as one trigger execution is completed the next one will start, the difference between ProcessingRate and InputRate will be marginal, but not zero.

4.4.3 Related Work

Spark [Spa17] and Spark Streaming [Str17b] have been adapted by the industry as key technologies in developing big data applications. Therefore, multiple stud- ies investigated how Spark performs under different workloads and benchmarks [Ous+15], including micro-benchmarks such as HiBench [Int18b] , SparkBench [Li+17] and Yahoo Streaming Benchmark [Chi+16]. Other important aspects like efficient memory management [KHB16; KB17; Awa+15; Awa+16] and competitive- ness with other frameworks like MapReduce and Flink [Shi+15; Vei+16; Mar+16] were also investigated.

Recently, a new approach for performance clarity, called Monotasks, was presented by Ousterhout et al. [Ous+17b; Ous+17a]. It points out the importance of un- derstanding and visualizing the bottlenecks in todays’ complex systems and was demonstrated through a Spark prototype. In the same spirit, an improvement of the Spark Streaming concept, called Drizzle [Ven+17], was also demonstrated to compensate for the overhead of micro-batching and bring it closer to native streaming.

Another work by Zhang et al. [ZCC17] presented a distributed streaming query engine running on a cluster and compared its performance to other streaming en- gines, with Structured Streaming being one of them. The work found out that many streaming operations are unsupported in the current implementation of Structured Streaming and therefore it is not possible to fully compare it with other stream engines. However, when looking at those operations that did work, the Structured Streaming performance was worse than when using Spark Streaming. As there are no extensive evaluations of the Spark Structured Streaming performance and bench- marks targeting exactly these types of engines, we address this in the remaining part of our study.

4.4.4 Structured Streaming Micro-benchmark

Before performing the evaluation it is important to identify the benchmark and methodology that will be used in the analysis. Since there are no benchmarks targeting streaming SQL engines as mentioned in the related work, we reuse and extend the BigBench V2 [Gha+17a] benchmark as a basis for our micro-benchmark.

208 Chapter 4 ABench: Big Data Architecture Stack Benchmark BigBench (TPCx-BB) [Gha+13b] is an end-to-end benchmark used to test Big Data systems, partly based on the TPC-DS benchmark. BigBench V2 [Gha+17a] is the updated version of BigBench and addresses some of the drawbacks in BigBench. It no longer uses complex queries from TPC-DS and simplifies the schema. In addition, semi-structured data is no longer handled as a structured table with a fixed schema, instead semi-structured data is treated as a pair of key-values. The new BigBench V2 data model retains the variety of structured, semi-structured and unstructured data, however, the structured data component now only includes six tables and product reviews are generated in a synthetic manner. The benchmark consists of 30 queries covering different business workload requirements.

Our experiments consist of two phases. In the first phase, data was generated using the BigBench V2 data generator and then manually split into files stored on disk to simulate a stream of data. In the second phase the queries were triggered to execute on the simulated stream data and produced a set of statistical results reporting performance and resource characteristics. The following subsections describe the query workloads, how they are implemented and finally the data preparation and execution phases.

Workloads

Out of the 30 queries that were part of the original BigBench, four were selected for testing, because they were suitable for real-time analytics and relevant from a business and technical point of view. The fifth query (Qmilk) is very simple and checks how many products of a certain type are sold in a particular time frame. The streaming workload consists of these five queries executed periodically on a stream of data. All queries are defined in plain text as followed:

• Q5 : Find the 10 most browsed products in the last 100 seconds.

• Q6 : Find the 5 most browsed products that were not purchased across all users (or only specific user) in the last 100 seconds.

• Q16 : Find the top ten pages visited by all users (or specific user) in the last 10 minutes.

• Q22 : Show the number of unique visitors in the last hour.

• Qmilk : Show the sold products (of a certain product or category).

Setup

The experiments were performed on a workstation machine with 8GB main memory, Intel Core i5 CPU 760 @3.47GHz x4 and 1TB hard disk. On top, Ubuntu LTS 16.04 was installed running Java version 1.8.0.131, Scala version 2.11.2 and Apache Spark

4.4 Exploratory Analysis of Spark Structured Streaming 209 2.3. Spark was used in standalone mode with the default configuration parameters for all experiments.

Data Preparation

For the data generation, we used the data generator in BigBench V2 [Gha+17a] with scale factor 1 and in particular the web logs consisting of web clicks in JSON format (around 20GB) and the web sales structured data( around 10MB). Web log file sizes ranging from 50MB to 2000MB were created to test the system performance at different file sizes. For every file size, 10 files were created to simulate a stream of 10 data files. As the web sales file was only 10MB large, that size was retained, but the file was multiplied 10 times to ensure 10 files could be streamed successfully. In total, 31 different combinations of parallel query executions were tested, the web sales file size for the Qmilk query was kept equal the entire time at 10MB. The web log sizes for all other queries (Q5 and Q16) were increased stepwise, starting from 50MB up to 2000MB.

Implementation

To this end four existing queries were initially chosen (Q05, Q06, Q16 and Q22) and a new one was created (Qmilk) to test the Structured Streaming environment. Due to technical limitations present in Structured Streaming at the time of writing, it was not possible to run queries Q06 and Q22 because of their specific nature. Query Q06 is the most complex query and performs join operations on two streaming datasets, which is not (yet) supported by Structured Streaming, as pointed out by Zhang et al. [ZCC17]. Performing a join between one streaming dataset and one static dataset is possible, but the SQL statement includes multiple aggregations (initiating the count function), which is an operation not yet supported on streaming datasets. Query Q22 uses a distinct operation, which is also not supported on streaming datasets. Furthermore, it utilizes sorting operations (Order By) that are only supported after an aggregation. Lastly, the Limit keyword is not supported at all in a streaming context. Hence, it was only possible to implement and test queries Q05, Q16 and Qmilk. The Scala code is available on Github [Taa18] together with all the test data and query starting code. Listing 4.8 shows the Scala Structured Streaming implementation of the three queries. The respective code examples are for Spark running in local mode, in order to run on a cluster several changes are necessary.

Code 4.8: Query 5, Query 16 and Qmilk 1 var web_logs_05 = web_logsDF 2 .groupBy("wl_item_id").count() 3 .orderBy("count") 4 .select("wl_item_id","count") 5 .where("wl_item_id␣IS␣NOT␣NULL") 6 7 var web_logs_16 = web_logsDF 8 .groupBy("wl_webpage_name").count() 9 .orderBy("count") 10 .select("wl_webpage_name","count") 11 .where("wl_webpage_name␣IS␣NOT␣NULL")

210 Chapter 4 ABench: Big Data Architecture Stack Benchmark 12 13 var web_sales_milk = web_salesDF 14 .groupBy("ws_product_id").count() 15 .orderBy("ws_product_id") 16 .select("ws_product_id","count") 17 .where("ws_product_ID␣IS␣NOT␣NULL")

Listing 4.9 shows how the query execution is triggered for query Q05, which is per- formed in phase two of the benchmark. Furthermore, the format option determines which output sink is to be used. As of writing there are four available options: file sink, foreach sink, console sink and memory sink. The file sink option stores the output to a directory, by setting the text within format to either parquet, json, csv or similar file types. The foreach sink runs arbitrary computation on the records. The console and memory sink options are primarily used for debugging purposes or when file sink is not an option. For our benchmark the console option, shown in Listing 4.9, was chosen to facilitate error detection and because the queries were preventing the use of the file sink option. As of writing it is not possible to write output to file if the queries use aggregation, which is the case here.

By setting the optional queryName in the configuration, the query gets an internal name that is used for reporting purposes and in naming the statistical files. The trigger configuration option is not required to run a streaming query. It determines at which interval Spark adds new rows to the input table. If the line is omitted Spark will update the input table as soon as possible, depending on the system performance, file complexity and the file size used for the streaming dataset. An additional option is to run the trigger only one time trigger(Trigger.Once()) after which the query will stop. If the user sets the processing time to be faster than the system can manage a warning message, with the actual processing time will be displayed, after the query is executed. The outputMode configuration option on Listing 4.9 (depicted also in the right part of Figure 4.10) determines what mode will be selected when writing the output to external storage. As of writing there are three mode options: complete mode, append mode and the newest, update mode. When complete mode is selected the entire updated result table will be written to the external storage. When choosing append mode, Spark only writes the new rows that were added to the result table since the last trigger to the external storage. The final option, update mode tells Spark to only write those rows to external storage that were updated in the result table since the last trigger, the emphasis here being old rows that were updated, and not new ones being added. The different modes are also subject to certain limitations.

Code 4.9: Spark WriteStream 1 var query05 = web_logs_05.writeStream 2 .format("console") 3 .queryName("05") 4 .trigger(Trigger 5 .ProcessingTime("150␣seconds")) 6 .outputMode(OutputMode.Complete()) 7 . start ()

Append mode does not work in our situation as queries are being run using aggre- gations based not on event-time, but on other attributes such as the number of

4.4 Exploratory Analysis of Spark Structured Streaming 211 web pages or product and item IDs. Furthermore, update mode is also not possible when using these queries as they include sorting operations which are not permitted. Hence, complete mode is the only available option, but at the same time this prevents file sink from being used as the output sink due to technical restrictions. Finally .start() tells Spark to initiate the streaming query.

Limitations

This subsection outlines some Structured Streaming limitations and problems that we encountered during our experiments.

File Creation Date: During testing, we discovered strange behavior in Structured Streaming, when manually copying and moving the data files to the assigned direc- tories. It appeared that when all the streaming files were copied to the streaming directory and had the same file creation or modification date, Spark considered these to be the same file even if the files had different names (web-logs1, web-logs2, etc.). After starting the program, the results of the SQL queries did not update following the first file. Instead the results array was either blank or statically displayed the results obtained after the first file was processed. This issue could not be replicated in all cases, but by making sure the file creation or modification dates differed slightly or by cutting, not copying, the files to the streaming directory, this issue was solved.

Bigger File Sizes: The largest file size used was 2000MB, this is relatively small in the Big Data context and did not test Sparks true capabilities. Future research should use larger files to investigate potential bottlenecks and peak performance. Ad- ditionally, as only ten files were streamed in each micro-benchmark the influence of fluctuations in system performance and other external factors cannot be discounted. To solve this issue longer streams with more files should be tested to facilitate more robust testing conditions.

Spark Cluster Mode: Finally, as Spark was run in standalone mode and not on a cluster the advantages of parallel computation were not utilized. The use of BigBench as an end-to-end benchmark on a cluster to reflect actual usage scenarios should be conducted in the future research.

4.4.5 Exploratory Analysis

This section looks at the results obtained after performing a series (around 31 different combinations) of experiments using the 3 queries implemented with Spark Structured Streaming. Initially, we observed that all runs took much longer than the following run as depicted in Figure 4.12. The reason for this is that the system needed to warm up before reaching a consistent level of processing speed. Therefore, the first measured data point out of the total 10 data points (streamed files) was omitted from our analysis.

212 Chapter 4 ABench: Big Data Architecture Stack Benchmark Fig. 4.12.: Latency Distribution

Fig. 4.13.: Optimal Trigger Processing Time

Figure 4.13 depicts the average execution times excluding the first run for all combinations of queries. We can identify three main groups of queries: group 1 consists of single executions of Q5 and Q16; group 2 consists of parallel pair executions of Q5, Q16 and Qmilk and finally group 3 consists of triple parallel executions of Q5, Q16 and Qmilk. In general, it can be said that the more queries are run in parallel and the larger the file sizes used, the longer these take, to be completed and the more resource intensive they are.

For example, Q5 median latency increased by more than 200% between the 50MB and 2000MB file sizes and when testing resource utilization across file sizes, the time to achieve completion more than doubled between the smallest and the largest file sizes. When comparing single query latency to triple query latency, it increased two to three-fold. Additionally, the larger the file sizes the more fragmented the query groupings became. Initially all queries could be assigned to three distinct groups based on their latency distribution and how many queries were running at the same time, yet later on five or six groups were observed.

The resource utilization also showed interesting results as CPU utilization was the highest at the 50MB level and decreased every time the file size was increased. In contrast, the JAVA heap size increased for larger file sizes. It looks like the memory

4.4 Exploratory Analysis of Spark Structured Streaming 213 size will be the limiting factor when using larger files, but this needs to be further proved with extensive tests in future research.

Another point is the gathering of Spark metrics. It needs to be improved and made more user friendly as, it was often more efficient to extract them via the configuration compared to the various sink options in Spark. Additionally, the sink option only exports three streaming metrics, whereas the log4j method offers more metrics for evaluation. The Structured Streaming trigger process itself is not documented and thorough experimentation was needed to understand the inner workings of the process as mentioned in Section 4.4.2 ( Figure 4.11 ).

4.4.6 Lessons Learned

Tab. 4.16.: Structured Streaming Pros and Cons

Pros Cons 1 Simple programming Undocumented trigger model and streaming process API 2 Several built-in metrics Query limitations due to streaming API 3 Several extraction and Complicated metric ex- sink options for metrics traction process 4 Possible to run queries in parallel

Based on our exploratory analyses and experiments we summarize the finding of our study. Structured Streaming exhibits several advantages over the Spark legacy streaming module, but is also subject to certain weaknesses listed in Table 4.16. The simple programming model and streaming API are countered by query limitations that restrict queries to operations supported in the current version of the API. And even though several metrics are already built-in to Structured Streaming, including various ways of extracting these for analysis, this process is still fairly complicated. By streamlining the metric extraction process it would facilitate faster analysis of the data and make Structured Streaming more attractive to companies searching for viable streaming solutions. Finally, the trigger process is not documented in a detailed and intuitive manner. As Structured Streaming is the newest addition to Apache Spark it still has room for improvement. Future updates could remedy these issues by improving documentation and extending the metric environment in Apache Spark.

214 Chapter 4 ABench: Big Data Architecture Stack Benchmark 4.5 ABench: Big Data Architecture Stack Benchmark

Abstract

Distributed big data processing and analytics applications demand a comprehensive end-to-end architecture stack consisting of big data technologies. However, there are many possible architecture patterns (e.g. Lambda, Kappa or Pipe- line architec- tures) to choose from when implementing the application requirements. A big data technology in isolation may be best performing for a particular application, but its performance in connection with other technologies depends on the connectors and the environment. Similarly, existing big data benchmarks evaluate the performance of different technologies in isolation, but no work has been done on benchmarking big data architecture stacks as a whole. For example, BigBench (TPCx-BB) may be used to evaluate the performance of Spark, but is it applicable to PySpark or to Spark with Kafka stack as well? What is the impact of having different programming environments and/or any other technology like Spark? This vision paper proposes a new category of benchmark, called ABench, to fill this gap and discusses key aspects necessary for the performance evaluation of different big data architecture stacks. The paper is based on the following publication:

• Todor Ivanov and Rekha Singhal, ABench: Big Data Architecture Stack Benchmark, in Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE 2018), April 9-13, Berlin, Germany, 2018, [IS18].

Keywords: ABench, BigBench, Big Data Benchmarking, Big Data.

4.5.1 Motivation

There is a growing number of new applications and use cases that challenge the capabilities of the existing systems due to availability of large size and high speed data. Big data analytics is one of the applications which is going viral both inside and outside an enterprise. The 3Vs characteristics of Big Data are still changing their dimensions helped by new Vs, defined by the growing need of the industry. On the other hand, the speed with which new technologies and features are emerging makes it very hard for both developers and users to keep track of the best technol- ogy on the market. Nowadays, many of these software tools are open source and created by communities, which do not have sufficient time and resources to keep the documentation up-to-date and present the tool in a way comparable to the en- terprise products. Moreover, full deployment of an application may require multiple technologies connected to each other such as streaming applications which may be deployed on Kafka and Storm platforms. This architecture should be modular by design, to accommodate mix-and-match configuration options as they arise. Lambda and Kappa architectures can differ in implementation, depending on technology deployed at each of the layers in the architecture stack. Exactly these variations in the architectural components lead to the relevance of the heterogeneity in big data architectures. It is a concept outlined in multiple studies [IIK16; PS16; Wu+16]

4.5 ABench: Big Data Architecture Stack Benchmark 215 as an emerging trend, which will open many new challenges. Moreover, size of application workload and data size may play a critical role in choosing an architec- ture. Performance prediction models such as [SS17; SV16] may be extended for architecture benchmarks to predict the performance for different workload and data size avoiding multiple executions of the benchmark.

There is a need to benchmark the performance of a given solution architecture which may get deployed using different set of technologies and hardware. We propose a new type of benchmark to evaluate the performance of big data stacks for different deployment architectures. This benchmark can act as a tool to evaluate performance of a particular big data technology/architecture stack for desired hardware. The benchmark can be used by solution architects to compare features and performance of an architecture instances with different technology specified at each layer. The available benchmarks [TPC18a] (TPC-DS, TPCx-BB, TPCx-HS, etc.) measure performance of a technology but do not address the performance of connectors connecting two technologies for creating an architecture.

The reminder of the paper is organized as follows: Section 4.5.2 gives an overview of the proposed benchmark and presents the main benchmark concepts. Section 4.5.3 presents a general benchmark framework for big data architecture benchmarks with challenges. Section 4.5.4 presents two use cases with BigBench and the final Section 4.5.5 concludes the paper.

4.5.2 Benchmark Overview

Historically, the TPC benchmarks [TPC18a] have been used as an industry stan- dard for performance comparisons of hardware and software systems. The TPC benchmark specifications are implemented by the vendors and then audited for transparency and fairness purposes. Today, several new technologies are open source and developed by communities such as the Apache Foundation and not by a single vendor company. Similarly, the primary users of the emerging benchmarks are the data engineers, software developers and architects participating in this open source communities, in contrast to the declining number of enterprises ready to create their own implementation of a TPC benchmark, execute it and finally go through the process of result auditing. For example, the TPC-DS benchmark is often used to stress test the SQL-on-Hadoop engines such as Hive, Impala and SparkSQL, but no officially audited results are published.

At the same time, the variety of new emerging big data technologies (including Data Science, Machine Learning and Deep Learning tools) opens the space and need for new standardized benchmarks that target exactly these new tools and analytics techniques. The challenge is to find new methodologies and techniques that will address this problem and provide us with a practical approach that solves the benchmarking gap. Our solution to the benchmarking gap is to develop a new type of Big Data Architecture Stack benchmark (ABench) that will first incorporate the best practices of the existing benchmarks and second will try to use innovative approaches to solve the challenges posed by the new big data technologies. In contrast to the typical TPC benchmarks, which provide strict specification, ABench will be more similar to the Java Client/Server benchmark defined by SPEC. The

216 Chapter 4 ABench: Big Data Architecture Stack Benchmark Fig. 4.14.: Abstract Big Data Stack benchmark framework will cover a broad spectrum of realistic use cases and the common best practice architectures for implementing them.

As mentioned, one of the biggest developers’ challenge is to deal with the complexity and variety of big data technologies in all layers of the data platform stacks. Figure 4.14 depicts the functional layers in a typical big data platform in an abstract way. Implementing a typical big data use case involves the use of tools from most of the depicted layers in the stack. This requires that the application developer has a good knowledge first of the application requirements and second of the available big data technologies at each layer. After identifying the necessary tools, he/she should be able to configure them properly and make sure they can exchange data in an effective way. The platform administration and the optimal configuration are other important points that are key for the application performance. What will be helpful in this situation? As a benchmark framework, ABench will provide a common big data application implementation that can be used for platform testing and starting point in the development process. For example, for a streaming application using a pre-configured Lambda or Kappa architecture with the implemented tools on the different stack layers will immensely reduce the starting overhead for the developer. Meanwhile, setting up the benchmark on his infrastructure will enable him to directly measure the performance of both the single tool and the entire platform stack including the overhead for data exchanges between the components. Furthermore, making our benchmark framework open source and accessible to everyone will give the opportunity for the users to contribute their code and improve the performance of the implemented best practices. Some of the basic principle that will be followed are:

• Open source implementation and extendable design

• Easy to setup and extend

• Include data generator or public data sets to simulate workload that stresses the architecture

• Reuse of existing benchmarks

Another important aspect in our vision is to define the benchmarking perspectives of ABench. Inspired by Andersen and Pettersen [AP95], who defined four benchmark types (generic, competitive, functional and internal benchmarking) in the context

4.5 ABench: Big Data Architecture Stack Benchmark 217 Fig. 4.15.: Types of Benchmarks

of a company comparison, we adapt these four benchmark types to the context of complex big data architecture stacks as follows:

• Generic Benchmarking checks if the general business requirements and spec- ifications according to which the implementation is done are fulfilled.

• Competitive Benchmarking is a performance comparison between the best tools on the platform layer that offer similar functionality.

• Functional Benchmarking is a functional comparison of the features of the tool against technologies from the same area.

• Internal Benchmarking is on the lowest level, comparing variations of the implementation code done using a particular tool.

Figure 4.15 briefly depicts the four types of benchmarks in quadrants together with examples of popular big data benchmarks. The generic benchmarking is done manually by people and as such it is hard to automate by software. However, it is possible to classify the big data applications in categories according to industry sector and domain, so that each category is represented by a benchmark that covers all main business requirements and specifications. Currently, many popular big data benchmarks such as HiBench [Hua+10b], YSCB (TPCx-IoT) [Coo+10] and BigBench (TPCx-BB) [Gha+13a; Gha+17b] are typically used for competitive and functional benchmarking to compare common features of different big data technologies. They focus on testing the technical aspects of big data technologies and are built with different goals in mind. For example, BigBench covers competitive (Hive, SparkSQL, etc.) and functional (the same HiveQL code on Hive and SparkSQL) benchmarking, SparkBench targets only the Spark engine and as such covers only the functional benchmarking, whereas HiBench covers competitive, functional (Spark, Flink, etc.) and internal benchmarking (Scala, Java, Python).

In short, our goal is to cover all four benchmark types and develop ABench as a multi-purpose benchmark framework that can be used in many big data scenarios. For example, one approach is to extend BigBench to cover more general applica-

218 Chapter 4 ABench: Big Data Architecture Stack Benchmark tion scenarios (e.g. streaming, machine learning, etc. for generic benchmarking requirements), to provide implementations in multiple big data technologies (e.g. Flink, Kafka, Impala etc. for competitive and functional benchmarking), and to offer implementations in different APIs (e.g. Spark, PySpark, R, etc. for internal benchmarking).

4.5.3 Architecture Benchmark Framework

Our proposal for the ABench benchmark framework shall stress test the common application business requirements (e.g. retail analytics, retail operational, etc.), big data technologies functionalities and best practice implementation architectures. The benchmark framework need to define following components:

Data Model: The types of data models and relationship across them. For example, a retail application may have structured, unstructured, semi-structures and stream data types and a social networking application may have graph data types. The relationship across different types of data need to be specified on how the one is derived from another. For relational and key-value store, data access mechanism need to be defined - sequential or random scan through index.

Data Store: For each type of data model, possible actions on data model with their required performance need to be specified. This may be used to decide the storage type in terms of persistence, partitioning, in-memory, duplicates or combination of these.

Data Generation: The framework need to have different types of data generators for stream messages, structured, graph, unstructured, documents etc. For example, stream data generator may also need to specify number of streams and velocity of generation whereas graph generator may specify number of nodes and edges between them. Depending on the data models in the architecture, the generators shall capture their interdependency.

Workload: The workloads could be business problem dependent if only specifica- tions are given or could be in form of SQL queries which may access data across different data models. The workloads could include graph queries, operational sys- tem workload, machine learning analytics, continuous queries and stream analytics. Each use case will focus on different functionality and system architecture, which also means that it will be implemented in different technologies.

Data Consistency and Security: Application level data consistency requirements such as ACID, CAP, etc. impact the performance of a technology. Moreover, different security constraints with different technology may result in different performance. For example mature technologies may ensure strict security versus recent ones. The workloads shall be able to capture these aspects in the benchmark.

Benchmark Control Knobs: The benchmark shall be able to execute with varying number of concurrent users per second whereby each user may click with his/her own speed (including think time) which results in the total concurrent sessions supported in the system. For example, in a streaming case each web click will

4.5 ABench: Big Data Architecture Stack Benchmark 219 generate a message which may also be controlled since the performance of messages and connectors depend on the message size moving across different layers of the architecture stack. The data sizes shall be varying for all data types - should we have control on feeding skewness in the data sets especially in web clicks, where a user may click only one type of products repeatedly.

Performance Data Collection: The benchmark shall have mechanism to collect technology specific, component specific and architecture specific performance coun- ters.

Benchmark Metric: Defining a benchmark metric (e.g. TPS or BBQpm@SF) is always a challenging task and especially in complex environments. Our benchmark defines functionally independent components (stack layer implemented in particular technology), which logically will have different internal metrics and measures. However, for all components the execution time is one of the most important metrics that we adapt as a main metric in the benchmark. The total (End-to-end) execution time of an application scenario is the sum of the execution times of each benchmark component including the technology and its connector.

The other most important performance metric especially for big data distributed system are scalability and reliability. One may need to define and calculate these in the context of an architecture either at each component level or the whole stack. The benchmark could have other metrics as throughput, energy efficiency and cost of the solution as well.

Challenges:

Below are the key challenges which one need to address while designing and/or creating architecture benchmarks.

• An architecture benchmark may be defined as hybrid of available benchmarks. The challenge may be to make those benchmarks work together on the same platform and build connectors across the chosen benchmarks

• Benchmark specifications may be provided to capture all functional and non- functional details (e.g. people, time, etc.). The challenge may be to give sufficient details for vendors to implement on any given set of technologies. Also, the validation of their implementations and result will be difficult.

• Express benchmarks (as defined by TPC) for a few architecture patterns with multiple technologies. However, it requires a lot of implementations and standardization of these benchmarks across different deployment systems.

• Where data generators should be residing especially for the dynamic data generator (such as stream), which may interfere with benchmark performance?

• Can the benchmark materialize fewer data sets in between to improve perfor- mance?

220 Chapter 4 ABench: Big Data Architecture Stack Benchmark 4.5.4 Benchmark Use Cases

This section proposes two concrete use cases for architecture benchmarks (as part of ABench) by re-using BigBench [Gha+13a; Gha+17b] as a baseline. The goal is to extend its retail business scenario to cover both stream processing workloads and advanced machine learning techniques.

Stream Processing

A retail business application deployment may need message platform (e.g. Kafka), streaming engine (e.g. Spark, Storm, Flink), in-memory store (e.g. MemSQL, MongoDB) and persistent data store (HDFS, HBase). A benchmark is needed to evaluate the performance of the whole architecture stack. The benchmark shall have control on ingestions of workloads in terms of the number of concurrent sessions and data size. Data streaming and processing is one type of workload which is still not addressed in BigBench and currently represents a huge interests in both research and industry. Moreover, continuous query workloads are not benchmarked with the current BigBench. Therefore, there are multiple system architectures like the Lambda and Kappa architectures that can be used as an implementation standards for this type of applications.

The stream dataset size is a function of the number, size and rate of messages per second. Semi-structured and structured datasets are function of the data size in the file or table. Some streaming queries in BigBench are:

• Find top 10 products (or categories) that are viewed by most of the (at least 50 customers viewed that) users in last 10 minutes from current date and time.

• Generate an offer if a user has done total purchase of more than USD 1000 together in his last 5 transactions.

• Find the top selling 10 products in last one hour.

• Show the number of unique visitors in last one hour.

Machine Learning

The traditional descriptive analytics and business intelligence (BI) have evolved and companies today rely on various machine learning (ML) techniques to get better and faster business insights. Gartner [Gar17] has defined four types of advanced analyt- ics that businesses adapt: descriptive analytics, diagnostic analytics, predictive analytics and prescriptive analytics.

In the current BigBench [Gha+13a; Gha+17b], five (Q5, Q20, Q25, Q26 and Q28) out of the 30 queries are covering common ML algorithms like Clustering (K-Means) or Classification (Logistic Regression and Naive Bayes). A recent evaluation of the

4.5 ABench: Big Data Architecture Stack Benchmark 221 benchmark has proposed to extend the BigBench workload with Collaborative Filter- ing using Matrix Factorization implementation in Spark MLlib via the Alternating Least Squares (ALS) method [Sin16]. The main objective is to extend the existing workloads to cover wider spectrum of the four advanced analytics types. At the same time there is a need of new type of ML metrics that will allow to compare the scalability and accuracy of different ML frameworks.

4.5.5 Conclusions

In this paper, we have proposed a new type of multi-purpose benchmark framework for big data architecture stacks, called ABench. It can be created reusing or extending existing big data benchmarks such as Hibench and BigBench. We have outlined a framework for these new benchmarks and proposed streaming and machine learning extensions based on BigBench.

222 Chapter 4 ABench: Big Data Architecture Stack Benchmark Conclusions 5

This chapter concludes the dissertation. It first gives an overview of on-going research work that are extending research of the thesis, and then it summarizes the thesis contributions.

5.1 On-going Research Work

Two on-going research work are thightly related to this thesis: ABench and DataBench.

5.1.1 ABench: Flexible Platform Infrastructure Implementation

ABench [IS18], presented in chapter 4, is a novel Big Data Architecture Stack Benchmark that takes into account the heterogeneity of Big Data architectures. The focus of this on-going research work is to provide the initial ABench proof of concept implementation that should build upon the extensions of BigBench V2.

The current implementation of ABench uses the flexibility of Docker [Doc19] con- tainer technology and Kubernetes [Kub19a], which helps to deploy and manage the different benchmark components and systems under test on both cloud and on- premise environments. Additional tools like Helm [Hel19] and Kubernetes Operator [Ope19] are utilized to automate the deployment process. Using these technologies help us build a standardized benchmark infrastructure that is independent of the system under test and do not require any specific driver modifications in order to be compatible with emerging processing and storage technologies.

The initial ABench design builds on top of BigBench V2, which means that the different Batch and Streaming modes need to be implemented as presented in section 4.3. For a better understanding, the architectures of the two benchmarking modes are shown separated.

Figure 5.1 depicts the Batch mode implementation of BigBench V2. In the right part of the figure are shown the abstract components of the Batch mode: the data generator, the Batch processing component and the persistent storage layer. The left part of the figure, shows how the abstract components are implemented in the ABench architecture using different technologies. The key infrastructure technology is Kubernetes, which manages the allocation of all hardware resources and their distribution among the Docker containers.

223 Another important point is the use of persistent storage (Local Persistent Storage service), which the containers access and use to store persistent data. Without this service, the data will not be available after container failure or restart.

Additionally, there are four services implemented on Docker containers. The Data Generator service is responsible for executing the BigBench data generator to gen- erate a pre-defined data size (scale factor) and store it persistently in the HDFS Persistent Storage service in the form of files. Next a warehouse schema is created and stored in the Hive Metastore service after which the data is inserted into the created Hive tables. Finally, the Data Generator service can submit the BigBench V2 queries to the Spark service running on top of Hadoop YARN, where the actual data processing happens assisted by the HDFS Persistent Storage and Hive Metastore services. The arrows indicate the type of data operations (read, write or both) performed between the services. Typically each service is running on at least two containers for fault-tolerance and high availability purposes, except the Hive Metas- tore and the Data Generator Service (Benchmark Driver), which is sufficient to run on a single Docker container.

Fig. 5.1.: Implementation of the BigBench V2 Batch Architecture in ABench

Figure 5.2 depicts the Streaming mode implementation of BigBench V2 presented in section 4.3. The right part of the figure illustrates the Streaming components in an abstract manner (Streaming Generator, Intermediate Fast-access layer, Stream processing and Persistent Storage Layer), whereas the left part depicts in which technologies are implemented the components. In contrast to the Batch mode, the Streaming mode has an extra Intermediate Fast-access Layer component, which realizes the data streaming between the data source (Streaming Generator) and the Stream processing component. In the ABench architecture, the Intermediate Fast-access Layer is implemented in Kafka [Kaf19], which serves as a data proxy between the Streaming Generator and the Stream processing component implemented as Spark service. The Streaming Generator service generates the data streams and writes them in the Kafka service. From there, the data chunks are read and processed by the Spark (Spark Structured Streaming [Str17c] library) service running on top of Hadoop YARN. The Hive Metastore is used again for keeping and retrieving the table schema. The arrows indicate the type of data operations (read, write or both) performed between the services.

The initial ABench implementation is realized in Minikube [Kub19b] which offers a single-node Kubernetes cluster inside a virtual machine (VM). The advantage of this approach is that you can run the VM in your local development environment

224 Chapter 5 Conclusions Fig. 5.2.: Implementation of the BigBench V2 Streaming Architecture in ABench and experiment with different service without any external limitations. In the first implementation phase are installed and tested all technology services necessary for both benchmark modes: Spark, HDFS Persistent Storage, Hive Metastore and Kafka services, while in the second phase the focus is to execute a complete run of the benchmark on top of these services.

5.1.2 ABench: Enriching the Machine Learning Workloads

This work is a continuation of the Machine Learning Use Case proposed in the ABench vision paper (Section 4.5.4). While BigBench V2 (Section 4.1) covers basic and commonly used machine learning algorithms, there are many new emerging machine and deep learning algorithms and libraries that are in daily use by both practitioners and researchers.

A recent paper by IBM [Sin16] recommends the addition of recommendation algo- rithms, which are a fundamental part of the modern business practices especially in the field of online retailers [Sar+00]. Motivated by this, the master thesis of Matthias Polag [Pol18] investigates and explores multiple new machine learning (ML) algorithms and libraries.

Table 5.1 summarizes the algorithm names and types implemented in the different workloads. The algorithms can be partitioned into 3 different categories: Clustering, Classification and Pattern detection. The workloads (queries) Q1, Q26, Q28, Q29 and Q30 are the one existing in BigBench V2, but were implemented with different algorithms. This is the focus of the first part of the investigation process, whereas in the second part are implemented completely new workloads. The remaining workloads M1, M2 and M3 are implementing new popular workloads such as frequent pattern mining, sentiment analysis and recommendations.

Tab. 5.1.: Overview of the explored Workloads, Workload Type and Algorithm Name

Workload Type Algorithm Q1 Pattern detection Query Q26 Clustering K-means Q26 Clustering Gaussian Mixture Model (GMM) Q28 Classification Naive Bayes

5.1 On-going Research Work 225 Continuation of Table 5.1 Q28 Classification Logistic Regression Q28 Classification Support Vector Machine (SVM) Q29 Pattern detection reducer.py Q30 Pattern detection reducer.py M1 Pattern detection Eclat M1 Pattern detection Frequent pattern growth (FP-growth) M2 Topic modeling Latent Dirichlet Allocation (LDA) M3 Classification Decision Tree M3 Classification Multi Layer Perceptron (MLP) M3 Classification Support Vector Machine (SVM) M3 Classification Naive Bayes M3 Classification Logistic Regression

Table 5.2 shows the different algorithms and the libraries in which they are available. Clearly, MLlib [MLl19] and System ML [ML19] offer support for all or most of the algorithms, which makes them a more favourable choice.

Tab. 5.2.: Overview of available Algorithms and Libraries

Name Mahout MLlib System ML Scikit Learn K-means Yes Yes Yes Yes Gaussian Mixture - Yes - - Model (GMM) Naive Bayes Yes Yes Yes - Logistic Regression - Yes Yes - Support Vector Ma- - Yes Yes - chine (SVM) Decision Tree - Yes Yes - Multilayer Percep- - Yes - Yes tron (MLP) Frequent pattern - Yes - Yes growth (FP-growth) End of Table

The experimental results are still under evaluation and in near future will be pre- sented in a research report.

226 Chapter 5 Conclusions 5.1.3 DataBench Project

DataBench1(Project ID: 780966) is a three year EU H2020 project (started in January 2018) that investigates existing Big Data benchmarking tools and projects, identifies the main gaps and provides a robust set of metrics to compare technical results coming from those tools.

At the heart of DataBench is the goal to design a benchmarking process helping European organizations developing Big Data Technologies to reach for excellence and constantly improve their performance, by measuring their technology development activity against parameters of high business relevance. DataBench investigates existing Big Data benchmarking tools and projects, identifies the main gaps and provides a robust set of metrics to compare technical results coming from those tools. To achieve this goal, the project is pursuing the following objectives:

1. Providing the Big Data Technology stakeholder communities with a comprehen- sive framework to integrate business and technical benchmarking approaches for Big Data Technologies.

2. Performing economic and market analysis to assess the "European economic significance" of benchmarking tools and performance parameters.

3. Evaluating the business impacts of Big Data Technology benchmarks of perfor- mance parameters of industrial significance.

4. Developing a tool for applying methodologies to determine optimal Big Data Technology benchmarking approaches.

5. Evaluation of the DataBench Framework and Toolbox in representative indus- tries, data experimentation/ integration initiatives (ICT-14) and Large-Scale Pilot (ICT-15).

6. Liaise closely with the BDVA, ICT-14 and 15 projects to build consensus and to reach out to key industrial communities, to ensure that benchmarking responds to real needs and problems.

7. Bringing together Research, Academia and industry establishing the Big Data Benchmarking Community.

Since the start of the project (January 2018), two related deliverables were com- pleted (D3.1 and D1.1) and two research papers. Currently another two deliverables are under development and their main contributions are taken from the Big Data Benchmarks Classification included in Appendix A. The completed deliverables and papers are listed below and can be downloaded on the DataBench project website2:

• Deliverable 1.1 - Industry Requirements with benchmark metrics and KPIs (2018) - Barbara Pernici, Chiara Francalanci, Angela Geronazzo, Lucia Polidori, Gabriella

1www.databench.eu 2www.databench.eu

5.1 On-going Research Work 227 Cattaneo, Helena Schwenk, Marko Grobelnik, Tomás Pariente, Iván Martínez, Todor Ivanov, Arne Berre

• Deliverable 3.1 - DataBench Architecture (2018) - Tomás Pariente, Iván Martínez, Ricardo Ruiz, Todor Ivanov, Arne Berre, Chiara Francalanci

• (position paper) DataBench: Evidence Based Big Data Benchmarking to Im- prove Business Performance, Todor Ivanov, Roberto V. Zicari, Tomás Pariente Lobo, Nuria de Lama Sanchez, Arne Berre, Volker Hoffmann, Richard Stevens, Gabriella Cattaneo, Helena Schwenk, Cristopher Ostberg-Hansen, Cristina Pepato, Barbara Pernici, Chiara Francalanci, Angela Geronazzo, Lucia Polidori, Paolo Giacomazzi, Marko Grobelnik, James Hodson

• (research paper) Relating Big Data Business and Technical Performance Indica- tors, Barbara Pernici, Chiara Francalanci, Angela Geronazzo, Lucia Polidori, Leonardo Riva, Stefano Ray, Arne Berre, Todor Ivanov

228 Chapter 5 Conclusions 5.2 Conclusions

The goal of this thesis is to help practitioners (e.g. software developers, system architects, performance engineers, etc.) to choose the most effective Big Data platform that does the best job for the class of Big Data applications chosen. The main thesis contributions cover various relevant aspects from understanding the current challenges in the Big Data platforms, choosing the most appropriate benchmark to stress test the selected Big Data technologies, and tuning the relevant platform components to process and store more efficiently data. To tackle the complexity and challenges in setting up and configuring a Big Data platform, this dissertation investigates and evaluates multiple technology components by applying standardized and novel Big Data benchmarks. In short, the main research contributions are:

1. Definition of the new concept of heterogeneity for Big Data Architectures (Chapter 2);

2. Investigation of the performance of Big Data systems (e.g. Hadoop) in virtual- ized environments (Section 3.1);

3. Investigation of the performance of NoSQL databases versus Hadoop distribu- tions (Section 3.2);

4. Execution and evaluation of the TPCx-HS benchmark (Section 3.3);

5. Evaluation and comparison of Hive and Spark SQL engines using benchmark queries (Section 3.4);

6. Evaluation of the impact of compression techniques on SQL-on-Hadoop engine performance (Section 3.5);

7. Extensions of the standardized Big Data benchmark BigBench (TPCx-BB) (Section 4.1 and 4.3);

8. Definition of a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data architectures (Section 4.5).

The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications. With the explosion of Artificial Intelligence (AI) and new hardware computing power, this is a first step towards a more holistic approach to benchmarking.

5.2 Conclusions 229

Bibliography

[14a] Apache Hadoop NextGen MapReduce. 2014. URL: http://hadoop.apache.org/ docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html (cit. on p. 83).

[14b] Cloudera CDH Datasheet. 2014. URL: http://www.cloudera.com/content/ cloudera/en/resources/library/datasheet/cdh-datasheet.html (cit. on p. 83).

[14c] Cloudera Manager Datasheet. 2014. URL: http://www.cloudera.com/content/ cloudera / en / resources / library / datasheet / cloudera - manager - 4 - datasheet.html (cit. on p. 83).

[14d] Configuration Parameters: What can you just ignore? 2014. URL: http://blog. cloudera.com/blog/2009/03/configuration-parameters-what-can-you- just-ignore/ (cit. on p. 84).

[14e] VMware vSphere Big Data Extensions. 2014. URL: https://www.vmware.com/ solutions/big-data.html (cit. on pp. 28, 31, 45, 46).

[14f] VMware vSphere Platform. 2014. URL: http://www.vmware.com/products/ vsphere/ (cit. on pp. 44, 45).

[15a] Parse-big-bench Utility. 2015. URL: https : / / github . com / BigData - Lab - Frankfurt/Big-Bench-Setup (cit. on p. 107).

[15b] Sandy Ryza - How-to: Tune Your Apache Spark Jobs (Part 2) | Cloudera Engineer- ing Blog. 2015. URL: https://blog.cloudera.com/blog/2015/03/how-to- tune-your-apache-spark-jobs-part-2/ (cit. on p. 108).

[15c] Yi Zhou - [SPARK-5791] [Spark SQL] show poor performance when multiple table do join operation. 2015. URL: https://issues.apache.org/jira/browse/ SPARK-5791 (cit. on pp. 112, 113, 116, 118).

[18a] Amazon Athena. 2018. URL: https://aws.amazon.com/athena/ (cit. on p. 122).

[18b] Amazon Elastic MapReduce (EMR). 2018. URL: https://aws.amazon.com/emr/ (cit. on p. 42).

[18c] Apache Arrow. 2018. URL: https://arrow.apache.org/ (cit. on p. 129).

[18d] Apache Hadoop. 2018. URL: http://hadoop.apache.org/ (cit. on pp. 15, 42, 44, 82, 131).

231 [18e] Apache Hadoop DFSIO benchmark. 2018. URL: http://svn.apache.org/ repos/asf/hadoop/common/tags/release-0.13.0/src/test/org/apache/ hadoop/fs/TestDFSIO.java (cit. on p. 54).

[18f] Apache HAWQ. 2018. URL: http://hawq.incubator.apache.org/ (cit. on p. 37).

[18g] Apache Mahout. 2018. URL: http://mahout.apache.org/ (cit. on pp. 44, 83, 131).

[18h] Apache OpenNLP. 2018. URL: https://opennlp.apache.org/ (cit. on pp. 114, 131).

[18i] Apache ORC. 2018. URL: https://orc.apache.org/ (cit. on pp. viii, 6, 8, 123, 125).

[18j] Apache Parquet. 2018. URL: http://parquet.apache.org/ (cit. on pp. viii, 6, 8, 120, 123, 126).

[18k] Apache Phoenix. 2018. URL: http://phoenix.apache.org/ (cit. on p. 122).

[18l] Apache Thrift. 2018. URL: https://thrift.apache.org (cit. on p. 126).

[18m] Dremel made simple with Parquet. 2018. URL: https://blog.twitter.com/ 2013/dremel-made-simple-with-parquet (cit. on p. 120).

[18n] Frankfurt Big Data Lab: Big-Bench-Setup. 2018. URL: https://github.com/ BigData-Lab-Frankfurt/Big-Bench-Setup (cit. on pp. 105, 107, 108, 130).

[18o] Frankfurt Big Data Lab: Detailed-Results. 2018. URL: https://github.com/ BigData-Lab-Frankfurt/ColumnarFileFormatsEvaluation (cit. on pp. 133, 148).

[18p] Hadoop Virtualization. 2018. URL: https://www.vmware.com/content/dam/ digitalmarketing/vmware/en/pdf/vmware-hadoop-virtualization.pdf (cit. on p. 45).

[18q] Hive configuration properties. 2018. URL: https : / / cwiki . apache . org / confluence/display/Hive/Configuration+Properties (cit. on p. 133).

[18r] Hue. 2018. URL: http://gethue.com/ (cit. on p. 158).

[18s] IBM Big SQL. 2018. URL: www.ibm.com/analytics/us/en/technology/big- sql/ (cit. on p. 122).

[18t] iPerf - The TCP, UDP and SCTP network bandwidth measurement tool. 2018. URL: https://iperf.fr (cit. on p. 85).

[18u] Measuring Network Performance: Test Network Throughput, Delay-Latency, Jitter, Transfer Speeds, Packet loss and Reliability. Packet Generation Using Iperf / Jperf. 2018. URL: http://www.firewall.cx/networking-topics/general- networking/970-network-performance-testing.html (cit. on p. 85).

[18v] OpenStack Sahara. 2018. URL: https://wiki.openstack.org/wiki/Sahara (cit. on pp. 31, 43).

[18w] OpenStack: Open Source Cloud Computing Software. 2018. URL: http://www. openstack.org (cit. on p. 43).

232 Bibliography [18x] Poelman J. IBM InfoSphere BigInsights Best Practices: Validating performance of a new cluster. 2018. URL: https://www.ibm.com/developerworks/community/ wikis/home?lang=en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/ page / Validating % 20performance % 20of % 20a % 20new % 20cluster (cit. on p. 85).

[18y] Project Serengeti. 2018. URL: https://github.com/vmware-serengeti (cit. on pp. 28, 31, 43, 45).

[18z] Protocol Buffers. 2018. URL: https://developers.google.com/protocolbuffers/ (cit. on p. 125).

[18aa] SerDe - Apache Hive. 2018. URL: https://cwiki.apache.org/confluence/ display/Hive/SerDe (cit. on p. 133).

[18ab] Snappy. 2018. URL: https://google.github.io/snappy/ (cit. on p. 124).

[18ac] Spark SQL and DataFrames - Spark 2.3.0 Documentation. 2018. URL: http: //spark.apache.org/sql/ (cit. on p. 8).

[18ad] Spark-Internals. 2018. URL: https://jaceklaskowski.gitbooks.io/mastering- apache-spark/spark-DAGScheduler-Stage.html (cit. on p. 148).

[18ae] SparkSQL configuration. 2018. URL: https : / / spark . apache . org / docs / latest/configuration.html#spark-sql (cit. on p. 133).

[18af] TPC-DS. 2018. URL: www.tpc.org/tpcds/ (cit. on pp. 3, 127, 129, 130, 164, 284).

[18ag] TPC-Energy. 2018. URL: http://www.tpc.org/tpc_energy/default.asp (cit. on p. 89).

[18ah] TPC-H. 2018. URL: www.tpc.org/tpch/ (cit. on pp. 3, 127, 129, 283, 284).

[18ai] Vertica. 2018. URL: https://www.vertica.com/ (cit. on p. 122).

[18aj] Virtualizing Hadoop on VMware vSphere. 2018. URL: https://www.vmware. com/content/dam/digitalmarketing/vmware/en/pdf/products/vsphere/ vmware-hadoop-deployment-guide.pdf (cit. on pp. 28, 45).

[18ak] zlib Home site. 2018. URL: https://zlib.net/ (cit. on pp. 124, 125).

[215] Oryx 2. 2015. URL: https://github.com/OryxProject/oryx (cit. on p. 34).

[Aba+14] Daniel J. Abadi, Rakesh Agrawal, Anastasia Ailamaki, et al. „The Beckman Report on Database Research“. In: SIGMOD Record 43.3 (2014), pp. 61–70 (cit. on p. 65).

[Aba+15] Daniel Abadi, Shivnath Babu, Fatma Özcan, and Ippokratis Pandis. „SQL-on- hadoop Systems: Tutorial“. In: Proc. VLDB Endow. 8.12 (Aug. 2015), pp. 2050– 2051 (cit. on p. 127).

[Abo+09] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. „HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads“. In: PVLDB 2.1 (2009), pp. 922– 933 (cit. on p. 25).

Bibliography 233 [Ado+16] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David M. Brooks. „Fathom: reference workloads for modern deep learning methods“. In: 2016 IEEE International Symposium on Workload Characterization, IISWC 2016, Providence, RI, USA, September 25-27, 2016. 2016, pp. 148–157 (cit. on p. 294).

[Aga+12] Sameer Agarwal, Aurojit Panda, Barzan Mozafari, et al. „Blink and It’s Done: Interactive Queries on Very Large Data“. In: PVLDB 5.12 (2012), pp. 1902–1905 (cit. on p. 37).

[Aga+13] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, et al. „BlinkDB: queries with bounded errors and bounded response times on very large data“. In: Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic, April 14-17, 2013. 2013, pp. 29–42 (cit. on p. 37).

[Agr+15] Dakshi Agrawal, Ali Raza Butt, Kshitij Doshi, et al. „SparkBench - A Spark Performance Testing Suite“. In: Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. 2015, pp. 26–44 (cit. on pp. 165, 272).

[Ahm+12] Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, and TN Vijaykumar. „Puma: Purdue mapreduce benchmarks suite“. In: (2012) (cit. on p. 277).

[Aki+13] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, et al. „MillWheel: Fault-Tolerant Stream Processing at Internet Scale“. In: PVLDB 6 (2013) (cit. on p. 192).

[Aki+15] Tyler Akidau, Robert Bradshaw, Craig Chambers, et al. „The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive- Scale, Unbounded, Out-of-Order Data Processing“. In: PVLDB 8.12 (2015) (cit. on p. 204).

[Ale+14] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. „The Stratosphere platform for big data analytics“. In: VLDB J. 23.6 (2014), pp. 939–964 (cit. on p. 38).

[Als+14] Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, et al. „AsterixDB: A Scalable, Open Source BDMS“. In: PVLDB 7.14 (2014), pp. 1905–1916 (cit. on p. 122).

[Alu+14] Güne¸sAluç, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. „Diversified stress testing of RDF data management systems“. In: International Semantic Web Conference. Springer. 2014, pp. 197–212 (cit. on p. 298).

[AMP13] AMP Lab. AMP Lab Big Data Benchmark. 2013. URL: https://amplab.cs. berkeley.edu/benchmark/ (cit. on p. 278).

[Ana15] AWS - Big Data Analytics. 2015. URL: http://aws.amazon.com/big-data/ (cit. on p. 20).

[And11] Andrew Pavlo. Benchmark. 2011. URL: http://database.cs.brown.edu/ projects/mapreduce-vs-dbms/ (cit. on pp. 3, 282).

[AP95] Bjørn Andersen and P-G Pettersen. Benchmarking handbook. Champman & Hall, 1995 (cit. on pp. 2, 217).

234 Bibliography [Apa09] Apache Software Foundation. Grep. 2009. URL: http://wiki.apache.org/ hadoop/Grep (cit. on p. 271).

[Apa10] Apache Software Foundation. DataGeneratorHadoop. 2010. URL: http://wiki. apache.org/pig/DataGeneratorHadoop (cit. on p. 277).

[Apa12] Apache Software Foundation. Running TPC-H Benchmark on Pig. 2012. URL: https://issues.apache.org/jira/browse/PIG-2397 (cit. on p. 284).

[Apa13a] Apache Software Foundation. GridMix. 2013. URL: https://hadoop.apache. org/docs/stable1/gridmix.html (cit. on p. 273).

[Apa13b] Apache Software Foundation. Hive performance benchmarks. 2013. URL: https: //issues.apache.org/jira/browse/HIVE-396 (cit. on p. 284).

[Apa13c] Apache Software Foundation. PigMix. 2013. URL: https://cwiki.apache. org/confluence/display/PIG/PigMix (cit. on pp. 166, 277).

[Apa15a] Apache Hadoop. TPC Express Benchmark HS - Standard Specification. 2015. URL: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/ examples/terasort/package-summary.html (cit. on pp. 86, 274).

[Apa15b] Apache Software Foundation. Package hadoop.examples.pi. 2015. URL: http:// hadoop.apache.org/docs/r0.23.11/api/org/apache/hadoop/examples/ pi/package-summary.html (cit. on p. 271).

[Apa15c] Apache Software Foundation. TPC-H and TPC-DS for Hive. 2015. URL: https: //github.com/hortonworks/hive-testbench/tree/hive14 (cit. on pp. 284, 285).

[Ara+04] Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, et al. „Linear Road: A Stream Data Management Benchmark“. In: (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004. 2004, pp. 480–491 (cit. on pp. 194, 285).

[Arm+10] Michael Armbrust, Armando Fox, Rean Griffith, et al. „A view of cloud comput- ing“. In: Commun. ACM 53.4 (2010), pp. 50–58 (cit. on p. 15).

[Arm+13] Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. „LinkBench: A Database Benchmark Based on The Facebook Social Graph“. In: SIGMOD. 2013, pp. 1185–1196 (cit. on pp. 165, 166, 285).

[Arm+15a] Michael Armbrust, Reynold S. Xin, Cheng Lian, et al. „Spark SQL: Relational Data Processing in Spark“. In: SIGMOD. 2015, pp. 1383–1394 (cit. on pp. 105, 106, 178).

[Arm+15b] Michael Armbrust, Reynold S. Xin, Cheng Lian, et al. „Spark SQL: Relational Data Processing in Spark“. In: Proceedings of the 2015 ACM SIGMOD Interna- tional Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 2015, pp. 1383–1394 (cit. on pp. 121–123, 205).

[Ast] Aster. URL: www.teradata.com/teradata-aster/ (cit. on p. 163).

[Awa+15] Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, and Eduard Ayguadé. „How Data Volume Affects Spark Based Data Analytics on a Scale-up Server“. In: 6th Workshop, BPOE 2015, Kohala, HI, USA, Aug. 31 - Sept. 4, 2015. 2015 (cit. on p. 208).

Bibliography 235 [Awa+16] Ahsan Javed Awan, Vladimir Vlassov, Mats Brorsson, and Eduard Ayguadé. „Node architecture implications for in-memory data analytics on scale-in clus- ters“. In: the 3rd IEEE/ACM BDCAT 2016, Shanghai, China, Dec. 6-9, 2016. 2016 (cit. on p. 208).

[Bab+02] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. „Models and Issues in Data Stream Systems“. In: the 21st ACM PODS, June 3-5, Madison, Wisconsin, USA. 2002 (cit. on p. 204).

[Bag+17] Guillaume Bagan, Angela Bonifati, Radu Ciucanu, et al. „gMark: Schema-Driven Generation of Graphs and Queries“. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 2017, pp. 63–64 (cit. on p. 298).

[BAM19] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. „Multimodal Machine Learning: A Survey and Taxonomy“. In: IEEE Trans. Pattern Anal. Mach. Intell. 41.2 (2019), pp. 423–443 (cit. on pp. v, 1).

[Bar+13a] Chaitan Baru, Milind Bhandarkar, Raghunath Nambiar, Meikel Poess, and Tilmann Rabl. „Setting the Direction for Big Data Benchmark Standards“. In: Selected Topics in Performance Evaluation and Benchmarking. Ed. by Raghunath Nambiar and Meikel Poess. Vol. 7755. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013, pp. 197–208 (cit. on pp. 19, 104).

[Bar+13b] Chaitanya Baru, Milind Bhandarkar, Raghunath Nambiar, Meikel Poess, and Tilmann Rabl. „Benchmarking big data systems and the bigdata top100 list“. In: Big Data 1.1 (2013), pp. 60–64 (cit. on p. 19).

[Bar+14] Chaitanya K. Baru, Milind A. Bhandarkar, Carlo Curino, et al. „Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data“. In: Performance Characterization and Benchmarking. Traditional to Big Data - 6th TPC Technology Conference, TPCTC 2014, Hangzhou, China, September 1-5, 2014. Revised Selected Papers. 2014, pp. 44–63 (cit. on pp. 19, 104, 105, 130, 181, 278).

[Bar+18] Giuseppe Baruffa, Mauro Femminella, Matteo Pergolesi, and Gianluca Reali. „A Big Data architecture for spectrum monitoring in cognitive radio applications“. In: Annals of Telecommunications 73.7-8 (2018), pp. 451–461 (cit. on p. 158).

[Bed17] Patrick Bedué. „Implementing real-time stream processing in the BigBench Benchmark“. Master Thesis. Goethe University Frankfurt, 2017 (cit. on p. 192).

[Bee15] Max-Georg Beer. „Evaluation of BigBench on Apache Spark Compared to MapReduce“. Master Thesis. Goethe University Frankfurt, 2015 (cit. on p. 104).

[Bha16] Milind Bhandarkar. „AdBench: A Complete Benchmark for Modern Data Pipelines“. In: Performance Evaluation and Benchmarking. Traditional - Big Data - Inter- est of Things - 8th TPC Technology Conference, TPCTC 2016, New Delhi, India, September 5-9, 2016, Revised Selected Papers. 2016, pp. 107–120 (cit. on p. 278).

[Bi12] Runaway complexity in Big Data... and Marz N. a plan to stop it. 2012. URL: http://www.slideshare.net/nathanmarz/runaway-complexity-in-big- data-and-a-plan-to-stop-it (cit. on p. 24).

236 Bibliography [Bia+17] Haoqiong Bian, Ying Yan, Wenbo Tao, et al. „Wide Table Layout Optimization based on Column Ordering and Duplication“. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017. 2017, pp. 299–314 (cit. on pp. 128, 131).

[Big13] BigFrame Team. BigFrame. 2013. URL: https://github.com/bigframeteam/ BigFrame/wiki (cit. on pp. 165, 279).

[Big15a] Intel Big Data Cloud: Converging Technologies. 2015. URL: http://www.intel. com / content / www / us / en / big - data / big - data - cloud - technologies - brief.html (cit. on p. 19).

[Big15b] Infochimps Big Data Technology Suite of Cloud Services. 2015. URL: http: //www.infochimps.com/infochimps-cloud/overview/ (cit. on p. 21).

[Bis+17] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, et al. „OpenML Bench- marking Suites and the OpenML100“. In: CoRR abs/1708.03731 (2017). arXiv: 1708.03731 (cit. on p. 292).

[Blu16] Ryan Blue. Parquet performance tuning: The missing guide. 2016. URL: https: //conferences.oreilly.com/strata/strata-ny-2016/public/schedule/ detail/52110 (cit. on p. 133).

[Bon+15] Andrew Bond, Douglas Johnson, Greg Kopczynski, and H. Reza Taheri. „Profil- ing the Performance of Virtualized Databases with the TPCx-V Benchmark“. In: Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. 2015, pp. 156–172 (cit. on p. 285).

[Bor+08] Dhruba Borthakur et al. „HDFS architecture guide“. In: Hadoop Apache Project 53 (2008), pp. 1–13 (cit. on p. 32).

[Bor07] Dhruba Borthakur. „The hadoop distributed file system: Architecture and de- sign“. In: Hadoop Project Website 11.2007 (2007), p. 21 (cit. on pp. 44, 82).

[Bou+08] Jerome Boulon, Andy Konwinski, Runping Qi, et al. „Chukwa, a large-scale monitoring system“. In: Proceedings of CCA. Vol. 8. 2008, pp. 1–5 (cit. on p. 33).

[Bou+18] Jalil Boukhobza, Stéphane Rubini, Renhai Chen, and Zili Shao. „Emerging NVM: A Survey on Architectural Integration and Research Challenges“. In: ACM Trans. Design Autom. Electr. Syst. 23.2 (2018), 14:1–14:32 (cit. on pp. v, 1).

[Bra+15] Lucas Braun, Thomas Etter, Georgios Gasparis, et al. „Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database“. In: Proceedings of the SIGMOD 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 2015, pp. 251–264 (cit. on p. 194).

[Bra17] Lucas Victor Braun-Löhrer. „Confidentiality and Performance for Cloud Databases“. PhD thesis. ETH Zurich, 2017 (cit. on p. 289).

[Bra18] Boudewijn Braams. Predicate Pushdown in Parquet and Databricks Spark (Master Thesis). 2018 (cit. on p. 124).

[Bro13] Stephen Brobst. „The Importance of Late Binding for Big Data Analytics“. In: Extremely Large Databases Conference. California, USA, 2013 (cit. on pp. 165, 181).

Bibliography 237 [BSC14] BSC. ALOJA home page: http://aloja.bsc.es/. 2014 (cit. on p. 299).

[Bu+10] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. „HaLoop: Efficient Iterative Data Processing on Large Clusters“. In: PVLDB 3.1 (2010), pp. 285–296 (cit. on pp. 25, 38).

[Bu+12] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. „The HaLoop approach to large-scale iterative data analysis“. In: VLDB J. 21.2 (2012), pp. 169–190 (cit. on p. 25).

[Bue13] Jeff Buell. „Virtualized Hadoop Performance with VMware vSphere 5.1“. In: VMware, Inc (2013) (cit. on pp. 42, 47, 52, 64).

[CAK12] Yanpei Chen, Sara Alspaugh, and Randy H. Katz. „Interactive Analytical Pro- cessing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads“. In: PVLDB 5.12 (2012), pp. 1802–1813 (cit. on p. 277).

[Cal18] Calcite. 2018. URL: calcite.apache.org/ (cit. on pp. 192, 205).

[Cam95] R.C. Camp. Business Process Benchmarking: Finding and Implementing Best Practices. The Asqc Total Quality Management. ASQC Quality Press, 1995 (cit. on p. 2).

[Car+15] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, et al. „Apache Flink™: Stream and Batch Processing in a Single Engine“. In: IEEE Data Eng. Bull. 38.4 (2015), pp. 28–38 (cit. on p. 192).

[Car12] Michael J. Carey. „BDMS Performance Evaluation: Practices, Pitfalls, and Possi- bilities“. In: Selected Topics in Performance Evaluation and Benchmarking - 4th TPC Technology Conference, TPCTC 2012, Istanbul, Turkey, August 27, 2012, Revised Selected Papers. 2012, pp. 108–123 (cit. on p. 104).

[Cas19] Apache Cassandra. 2019. URL: http://cassandra.apache.org/ (cit. on p. 8).

[Cat10] Rick Cattell. „Scalable SQL and NoSQL data stores“. In: SIGMOD Record 39.4 (2010), pp. 12–27 (cit. on pp. 18, 24, 44, 65, 82).

[CC12] CRAP and Fan C. CRUD: From Database to Datacloud. 2012. URL: https: / / blog . dellemc . com / en - us / crap - and - crud - from - database - to - datacloud/ (cit. on p. 24).

[CDH15] Cloudera Hadoop Distribution (CDH). 2015. URL: http://www.cloudera.com (cit. on pp. 67, 69, 83).

[Cha+08] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al. „Bigtable: A Distributed Storage System for Structured Data“. In: ACM Trans. Comput. Syst. 26.2 (2008), 4:1–4:26 (cit. on p. 32).

[Cha+14] Lei Chang, Zhanwei Wang, Tao Ma, et al. „HAWQ: a massively parallel process- ing SQL engine in hadoop“. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 2014, pp. 1223–1234 (cit. on p. 122).

[Che+11] Yanpei Chen, Archana Ganapathi, Rean Griffith, and Randy H. Katz. „The Case for Evaluating MapReduce Performance Using Workload Suites“. In: MASCOTS 2011, 19th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Singapore, 25-27 July, 2011. 2011, pp. 390–399 (cit. on p. 277).

238 Bibliography [Che+12] Yanpei Chen et al. „We don’t know enough to make a big data benchmark suite-an academia-industry view“. In: Proc. of WBDB (2012) (cit. on pp. v, 1, 104).

[Che+14] Yueguo Chen, Xiongpai Qin, Haoqiong Bian, et al. „A Study of SQL-on-Hadoop Systems“. In: Big Data Benchmarks, Performance Optimization, and Emerging Hardware - 4th and 5th Workshops, BPOE 2014, Salt Lake City, USA, March 1, 2014 and Hangzhou, China, September 5, 2014, Revised Selected Papers. 2014, pp. 154–166 (cit. on pp. 121, 127, 128).

[Chi+13] Pradeep Chintagunta, Dominique M. Hanssens, John R. Hauser, et al. „Editorial - Marketing Science: A Strategic Review“. In: Marketing Science 32.1 (2013), pp. 4–7 (cit. on p. 15).

[Chi+16] Sanket Chintapalli, Derek Dagit, Bobby Evans, et al. „Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming“. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016. 2016, pp. 1789–1792 (cit. on pp. 194, 208, 287).

[Chi18] Soumith Chintala. 2018. URL: https : / / github . com / soumith / convnet - benchmarks/issues/101 (cit. on p. 293).

[Cho+13a] Hyunsik Choi, Jihoon Son, Haemi Yang, et al. „Tajo: A distributed data ware- house system on large clusters“. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013. 2013, pp. 1320– 1323 (cit. on pp. 36, 122).

[Cho+13b] Badrul Chowdhury, Tilmann Rabl, Pooya Saadatpanah, Jiang Du, and Hans- Arno Jacobsen. „A BigBench Implementation in the Hadoop Ecosystem“. In: Advancing Big Data Benchmarks - Proceedings of the 2013 Workshop Series on Big Data Benchmarking, WBDB.cn, Xi’an, China, July 16-17, 2013 and WBDB.us, San José, CA, USA, October 9-10, 2013 Revised Selected Papers. 2013, pp. 3–18 (cit. on pp. 105, 130).

[Cho+13c] Badrul Chowdhury, Tilmann Rabl, Pooya Saadatpanah, Jiang Du, and Hans- Arno Jacobsen. „A BigBench Implementation in The Hadoop Ecosystem“. In: Proceedings of the 2013 Workshop on Big Data Benchmarking. 2013, pp. 3–18 (cit. on p. 165).

[Chu+13] Byung-Gon Chun, Tyson Condie, Carlo Curino, et al. „REEF: Retainable Evalu- ator Execution Framework“. In: PVLDB 6.12 (2013), pp. 1370–1373 (cit. on p. 34).

[Clo] Cloudera. URL: www.cloudera.com (cit. on p. 163).

[Cod18] Structured Streaming Code. 2018. URL: https://github.com/apache/spark/ tree/fa0092bddf695a757f5ddaed539e55e2dc9fccb7/sql/core/src/main/ scala/org/apache/spark/sql/streaming (cit. on p. 206).

[Col+18] Cody Coleman, Daniel Kang, Deepak Narayanan, et al. „Analysis of DAWN- Bench, a Time-to-Accuracy Machine Learning Performance Benchmark“. In: CoRR abs/1806.01427 (2018). arXiv: 1806.01427 (cit. on p. 295).

Bibliography 239 [Coo+10] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. „Benchmarking cloud serving systems with YCSB“. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, Indianapolis, Indiana, USA, June 10-11, 2010. 2010, pp. 143–154 (cit. on pp. 67, 218, 273).

[Cos+16] Andrei Costea, Adrian Ionescu, Bogdan Raducanu, et al. „VectorH: Taking SQL-on-Hadoop to the Next Level“. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. 2016, pp. 1105–1117 (cit. on pp. 121, 127, 128).

[CPC15] Enrico Giacinto Caldarola, Antonio Picariello, and Daniela Castelluccia. „Mod- ern Enterprises in the Bubble: Why Big Data Matters“. In: ACM SIGSOFT Software Engineering Notes 40.1 (2015), pp. 1–4 (cit. on p. 26).

[CQL15] Cassandra Query Language (CQL). 2015. URL: https://cassandra.apache. org/doc/old/CQL-3.0.html (cit. on p. 66).

[CRK12] Yanpei Chen, Francois Raab, and Randy Katz. „From TPC-C to Big Data Bench- marks: A Functional Workload Model“. In: WBDB. 2012 (cit. on p. 104).

[CRW12] Edward Capriolo, Jason Rutherglen, and Dean Wampler. Programming Hive - Data Warehouse and Query Language for Hadoop. O’Reilly, 2012 (cit. on p. 36).

[Dat15] FAQ: DataStax). 2015. URL: http://www.datastax.com/resources/faq#dse- 18 (cit. on p. 70).

[Ded+13] Elif Dede, Bedri Sendir, Pinar Kuzlu, Jessica Hartog, and Madhusudhan Govin- daraju. „An Evaluation of Cassandra for Hadoop“. In: 2013 IEEE Sixth Interna- tional Conference on Cloud Computing, Santa Clara, CA, USA, June 28 - July 3, 2013. 2013, pp. 494–501 (cit. on pp. 68, 76).

[Dee18a] DeepBench. 2018. URL: https://github.com/baidu-research/DeepBench (cit. on p. 294).

[Dee18b] What is Deeplearning according to Nvidia. 2018. URL: https://developer. nvidia.com/deeplearning (cit. on p. 293).

[Del13] N. Wakou Dell Apache Hadoop Performance Analysis. 2013. URL: http://en. community.dell.com/techcenter/%20extras/m/white_papers/20437989 (cit. on pp. 73, 77).

[DeW+13] David J. DeWitt, Alan Halverson, Rimma V. Nehme, et al. „Split query process- ing in polybase“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013. 2013, pp. 1255–1266 (cit. on p. 25).

[DG08] Jeffrey Dean and Sanjay Ghemawat. „MapReduce: Simplified Data Processing on Large Clusters“. In: Communications of the ACM 51.1 (2008), pp. 107–113 (cit. on pp. 23, 24, 33, 44, 82).

[Die12] Francis X Diebold. „On the Origin (s) and Development of the Term’Big Data’“. In: (2012) (cit. on p. 15).

240 Bibliography [Din+17] Tien Tuan Anh Dinh, Ji Wang, Gang Chen, et al. „BLOCKBENCH: A Framework for Analyzing Private Blockchains“. In: Proceedings of the 2017 ACM Interna- tional Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017. 2017, pp. 1085–1100 (cit. on p. 298).

[Dit+10] Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, et al. „Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)“. In: PVLDB 3.1 (2010), pp. 518–529 (cit. on p. 25).

[Doc15] DataStax Enterprise 4.0 Documentation. 2015. URL: https://docs.datastax. com/en/landing_page/doc/landing_page/current.html (cit. on p. 70).

[Doc19] Docker. 2019. URL: https://www.docker.com/ (cit. on pp. 29, 223).

[Dri19] Apache Drill. 2019. URL: drill.apache.org (cit. on pp. 178, 192).

[DS16] Srinivas Duvvuri and Bikramaditya Singhal. Spark for Data Science. Packt Publishing Ltd, 2016 (cit. on p. 205).

[Dug+15] Jennie Duggan, Aaron Elmore, Tim Kraska, et al. „The bigdawg architecture and reference implementation“. In: New England Database Day (2015) (cit. on p. 26).

[Eic+18] Philipp Eichmann, Carsten Binnig, Tim Kraska, and Emanuel Zgraggen. „IDEBench: A Benchmark for Interactive Data Exploration“. In: CoRR abs/1804.02593 (2018). arXiv: 1804.02593 (cit. on p. 298).

[Eng+12] Cliff Engle, Antonio Lupher, Reynold Xin, et al. „Shark: fast data analysis using coarse-grained distributed memory“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012. 2012, pp. 689–692 (cit. on p. 36).

[Ent15] DataStax Enterprise. 2015. URL: http://www.datastax.com (cit. on pp. 65, 67, 68, 74).

[ETM96] Ilija Ekmecic, Igor Tartalja, and Veljko Milutinovic. „A survey of heterogeneous computing: concepts and systems“. In: Proceedings of the IEEE 84.8 (1996), pp. 1127–1144 (cit. on p. 22).

[Fer+12a] Michael Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, et al. „Clearing The Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware“. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS. 2012, pp. 37–48 (cit. on pp. 165, 166).

[Fer+12b] Michael Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, et al. „Clearing the clouds: a study of emerging scale-out workloads on modern hardware“. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 3-7, 2012. 2012, pp. 37–48 (cit. on pp. 3, 280, 281).

[Fer+13a] Jaume Ferrarons, Mulu Adhana, Carlos Colmenares, et al. „PRIMEBALL: A Parallel Processing Framework Benchmark for Big Data Applications in the Cloud“. In: TPCTC. 2013, pp. 109–124 (cit. on p. 165).

Bibliography 241 [Fer+13b] Jaume Ferrarons, Mulu Adhana, Carlos Colmenares, et al. „PRIMEBALL: A Parallel Processing Framework Benchmark for Big Data Applications in the Cloud“. In: Performance Characterization and Benchmarking - 5th TPC Technology Conference, TPCTC 2013, Trento, Italy, August 26, 2013, Revised Selected Papers. 2013, pp. 109–124 (cit. on p. 282).

[Fli17] Apache Flink. 2017. URL: https://flink.apache.org/ (cit. on p. 205).

[FMÖ14] Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. „SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures“. In: PVLDB 7.12 (2014), pp. 1295–1306 (cit. on pp. 121, 127).

[Fou15] Pivotal Cloud Foundry. 2015. URL: http://www.pivotal.io/platform-as-a- service/pivotal-cloud-foundry (cit. on p. 21).

[FPR16] Mauro Femminella, Matteo Pergolesi, and Gianluca Reali. „Performance eval- uation of edge cloud computing system for big data applications“. In: Cloud Networking (Cloudnet), 2016 5th IEEE International Conference on. IEEE, 2016, pp. 170–175 (cit. on p. 158).

[Gao+18] Libo Gao, Lukasz Golab, M. Tamer Özsu, and Günes Aluç. „Stream WatDiv: A Streaming RDF Benchmark“. In: Proceedings of the International Workshop on Semantic Big Data, SBD@SIGMOD 2018, Houston, TX, USA, June 10, 2018. 2018, 3:1–3:6 (cit. on p. 298).

[Gar17] Gartner. Planning Guide for Data and Analytics. 2017. URL: www.gartner.com/ doc/3471553/-planning-guide-data-analytics (cit. on p. 221).

[Gat+09] Alan F Gates, Olga Natkovich, Shubham Chopra, et al. „Building a high-level dataflow system on top of Map-Reduce: the Pig experience“. In: Proceedings of the VLDB Endowment 2.2 (2009), pp. 1414–1425 (cit. on pp. 39, 44, 83).

[Geo11] Lars George. HBase - The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly, 2011 (cit. on pp. 32, 44, 83).

[Gha+13a] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, et al. „BigBench: Towards an Industry Standard Benchmark for Big Data Analytics“. In: SIGMOD. 2013 (cit. on pp. 165, 166, 172, 193, 194, 199, 218, 221).

[Gha+13b] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, et al. „BigBench: towards an industry standard benchmark for big data analytics“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013. 2013, pp. 1197–1208 (cit. on pp. 104, 121, 130, 164, 181, 209, 268, 278).

[Gha+17a] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, et al. „BigBench V2: The New and Improved BigBench“. In: 33rd IEEE International Conference on Data Engi- neering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 2017, pp. 1225– 1236 (cit. on pp. 163, 181, 193, 197, 205, 208–210).

[Gha+17b] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, et al. „BigBench V2: The New and Improved BigBench“. In: 33rd IEEE ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 2017 (cit. on pp. 194, 218, 221).

[GÖ03] Lukasz Golab and M. Tamer Özsu. „Issues in data stream management“. In: SIGMOD Record 32.2 (2003) (cit. on p. 204).

242 Bibliography [GoG15] GoGrid GoGrid Solutions. 2015. URL: http://www.gogrid.com/solutions (cit. on p. 21).

[Gon+14] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, et al. „GraphX: Graph Pro- cessing in a Distributed Dataflow Framework“. In: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014. 2014, pp. 599–613 (cit. on p. 34).

[Goo15] Google Google Cloud Platform. 2015. URL: https://cloud.google.com/ (cit. on p. 20).

[Gra92] Jim Gray. Benchmark Handbook: For Database and Transaction Processing Sys- tems. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992 (cit. on pp. v, 1–3, 268, 270, 290).

[Gre+09] Albert G. Greenberg, James R. Hamilton, David A. Maltz, and Parveen Patel. „The cost of a cloud: research problems in data center networks“. In: Computer Communication Review 39.1 (2009), pp. 68–73 (cit. on p. 18).

[Gre18] Matt Turck Great Power Great Responsibility: The 2018 Big Data AI Landscape. 2018. URL: http://mattturck.com/bigdata2018/ (cit. on pp. v, 1).

[Gru+10] Martin Grund, Jens Krüger, Hasso Plattner, et al. „HYRISE - A Main Memory Hybrid Storage Engine“. In: PVLDB 4.2 (2010), pp. 105–116 (cit. on p. 18).

[Gui18] Structured Streaming Programming Guide. 2018. URL: spark.apache.org/ docs/latest/structured- streaming- programming- guide.html (cit. on pp. 205, 206).

[H2O15] H2O. 2015. URL: https://github.com/h2oai/h2o (cit. on p. 35).

[Had13] Baer T. Hadoop as your other data warehouse. 2013. URL: http : / / www . onstrategies . com / blog / 2013 / 05 / 05 / hadoop - as - your - other - data - warehouse/ (cit. on p. 35).

[Had15] Cloud Services - HDInsight (Hadoop). 2015. URL: http://azure.microsoft. com/en-us/services/hdinsight/ (cit. on p. 21).

[Has+15] Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, et al. „The rise of “big data” on cloud computing: Review and open research issues“. In: Information systems 47 (2015), pp. 98–115 (cit. on pp. 19, 26).

[HAW15] Pivotal HAWQ. 2015. URL: http://www.pivotal.io/big-data/hadoop/sql- on-hadoop (cit. on p. 37).

[He+11] Yongqiang He, Rubao Lee, Yin Huai, et al. „RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems“. In: Pro- ceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany. 2011, pp. 1199–1208 (cit. on p. 125).

[Hel19] Helm. 2019. URL: https://helm.sh/ (cit. on p. 223).

[Her+11] Herodotos Herodotou, Harold Lim, Gang Luo, et al. „Starfish: A Self-tuning System for Big Data Analytics“. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings. 2011, pp. 261–272 (cit. on p. 25).

Bibliography 243 [Hes+17] Guenter Hesse, Benjamin Reissaus, Christoph Matthies, et al. „Senska - Towards an Enterprise Streaming Benchmark“. In: Performance Evaluation and Bench- marking for the Analytics Era - 9th TPC Technology Conference, TPCTC 2017, Munich, Germany, August 28, 2017, Revised Selected Papers. 2017, pp. 25–40 (cit. on p. 290).

[Hin+11] Benjamin Hindman, Andy Konwinski, Matei Zaharia, et al. „Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center“. In: Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011. 2011 (cit. on p. 29).

[Hiva] Hive. URL: hive.apache.org (cit. on pp. 8, 131, 173, 175).

[Hivb] Hive Lateral View. URL: www.cwiki.apache.org/confluence/display/Hive/ LanguageManual+LateralView (cit. on pp. 175, 182).

[HJZ18] Rui Han, Lizy Kurian John, and Jianfeng Zhan. „Benchmarking Big Data Sys- tems: A Review“. In: IEEE Trans. Services Computing 11.3 (2018), pp. 580–597 (cit. on p. 3).

[HN13] Michael Hausenblas and Jacques Nadeau. „Apache drill: interactive ad-hoc analysis at scale“. In: Big Data 1.2 (2013), pp. 100–104 (cit. on pp. 36, 122, 178).

[Hoc96] Roger W Hockney. The science of computer benchmarking. SIAM, 1996 (cit. on p. 270).

[Hog09] Trish Hogan. „Overview of TPC Benchmark E: The Next Generation of OLTP Benchmarks“. In: Performance Evaluation and Benchmarking, First TPC Technol- ogy Conference, TPCTC 2009, Lyon, France, August 24-28, 2009, Revised Selected Papers. 2009, pp. 84–98 (cit. on p. 268).

[HS13] Matthew Hayes and Sam Shah. „Hourglass: A library for incremental processing on Hadoop“. In: Proceedings of the 2013 IEEE International Conference on Big Data, 6-9 October 2013, Santa Clara, CA, USA. 2013, pp. 742–752 (cit. on p. 37).

[Hu+14] Han Hu, Yonggang Wen, Tat-Seng Chua, and Xuelong Li. „Toward scalable systems for big data analytics: A technology tutorial“. In: IEEE access 2 (2014), pp. 652–687 (cit. on pp. 26, 44, 82).

[Hua+10a] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. „The HiBench benchmark suite: Characterization of the MapReduce-based data analysis“. In: Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. 2010, pp. 41–51 (cit. on pp. 46, 47, 50, 52, 55, 57, 66, 271, 278).

[Hua+10b] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. „The HiBench benchmark suite: Characterization of the MapReduce-based data analysis“. In: 26th IEEE Data Engineering Workshops (ICDEW), 2010. IEEE. 2010 (cit. on pp. 66, 69, 193, 199, 218).

244 Bibliography [Hua+10c] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. „The HiBench benchmark suite: Characterization of the MapReduce-based data analysis“. In: Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. 2010, pp. 41–51 (cit. on p. 69).

[Hua+10d] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. „The HiBench Benchmark Suite: Characterization of The MapReduce-based Data Analysis“. In: Workshops Proceedings of the 26th IEEE ICDE International Confer- ence on Data Engineering. 2010, pp. 41–51 (cit. on pp. 165, 166).

[Hua+13] Yin Huai, Siyuan Ma, Rubao Lee, Owen O’Malley, and Xiaodong Zhang. „Un- derstanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters“. In: PVLDB 6.14 (2013), pp. 1750–1761 (cit. on p. 18).

[Hua+14] Yin Huai, Ashutosh Chauhan, Alan Gates, et al. „Major technical advancements in apache hive“. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 2014, pp. 1235–1246 (cit. on pp. 120, 123, 125, 131, 132).

[Hua17] Yin Huai. 2017. URL: https://databricks.com/blog/2015/02/02/an- introduction-to-json-support-in-spark-sql.html (cit. on p. 183).

[Hun+10] Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. „ZooKeeper: Wait-free Coordination for Internet-scale Systems“. In: 2010 USENIX Annual Technical Conference, Boston, MA, USA, June 23-25, 2010. 2010 (cit. on p. 30).

[Hup09] Karl Huppler. „The art of building a good benchmark“. In: Performance Evalua- tion and Benchmarking. Springer, 2009, pp. 18–30 (cit. on p. 4).

[HW79] John A Hartigan and Manchek A Wong. „Algorithm AS 136: A k-means clus- tering algorithm“. In: Journal of the Royal Statistical Society. Series C (Applied Statistics) 28.1 (1979), pp. 100–108 (cit. on p. 153).

[IB15a] Todor Ivanov and Max-Georg Beer. „Evaluating Hive and Spark SQL with BigBench“. In: CoRR abs/1512.08417 (2015) (cit. on p. 141).

[IB15b] Todor Ivanov and Max-Georg Beer. „Evaluating Hive and Spark SQL with BigBench (Technical Report)“. In: CoRR abs/1512.08417 (2015). arXiv: 1512. 08417 (cit. on pp. 104, 107–109, 113).

[IB15c] Todor Ivanov and Max-Georg Beer. „Performance Evaluation of Spark SQL Using BigBench“. In: Big Data Benchmarking - 6th International Workshop, WBDB 2015, Toronto, ON, Canada, June 16-17, 2015 and 7th International Workshop, WBDB 2015, New Delhi, India, December 14-15, 2015, Revised Selected Papers. 2015, pp. 96–116 (cit. on p. 104).

[IB15d] Todor Ivanov and Max-Georg Beer. „Performance Evaluation of Spark SQL Using BigBench“. In: Big Data Benchmarking - 6th International Workshop, WBDB 2015, Toronto, ON, Canada, June 16-17, 2015 and 7th International Workshop, WBDB 2015, New Delhi, India, December 14-15, 2015, Revised Selected Papers. 2015, pp. 96–116 (cit. on p. 141).

Bibliography 245 [IB15e] Todor Ivanov and Max-Georg Beer. „Performance evaluation of spark SQL using BigBench“. In: Workshop on Big Data Benchmarks. Springer. 2015, pp. 96–116 (cit. on p. 199).

[IBM18] IBM. 2018. URL: https://github.com/CODAIT/spark-bench (cit. on p. 272).

[ICT13a] ICT, Chinese Academy of Sciences. CloudRank-D. 2013. URL: http://prof. ict.ac.cn/CloudRank/ (cit. on p. 280).

[ICT13b] ICT, Chinese Academy of Sciences. DCBench. 2013. URL: http://prof.ict.ac. cn/DCBench/ (cit. on p. 279).

[ICT15] ICT, Chinese Academy of Sciences. BigDataBench 3.1. 2015. URL: http://prof. ict.ac.cn/BigDataBench/ (cit. on pp. 3, 279).

[II15] Todor Ivanov and Sead Izberovic. „Evaluating Hadoop Clusters with TPCx-HS (Technical Report)“. In: CoRR abs/1509.03486 (2015). arXiv: 1509.03486 (cit. on pp. 66, 81).

[IIK16] Todor Ivanov, Sead Izberovic, and Nikolaos Korfiatis. The Heterogeneity Paradigm in Big Data Architectures. IGI Global, 2016, pp. 218–245 (cit. on pp. 15, 215).

[IKZ13] Todor Ivanov, Nikolaos Korfiatis, and Roberto V. Zicari. „On the inequality of the 3V’s of Big Data Architectural Paradigms: A case for heterogeneity (Technical Report)“. In: CoRR abs/1311.0805 (2013). arXiv: 1311.0805 (cit. on pp. 15, 44, 82).

[Int15] Intel. HiBench Suite. 2015. URL: https : / / github . com / intel - hadoop / HiBench (cit. on p. 3).

[Int18a] Intel. 2018. URL: https://github.com/intel-hadoop/Big-Data-Benchmark- for-Big-Bench (cit. on pp. 105, 278).

[Int18b] Intel. 2018. URL: https://github.com/intel- hadoop/HiBench (cit. on pp. 208, 271, 278).

[Ios+16] Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, et al. „LDBC Graphalyt- ics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms“. In: PVLDB 9.13 (2016), pp. 1317–1328 (cit. on p. 297).

[IPB12] Todor Ivanov, Ilia Petrov, and Alejandro P. Buchmann. „A Survey on Database Performance in Virtualized Cloud Environments“. In: IJDWM 8.3 (2012), pp. 1– 26 (cit. on p. 19).

[IS18] Todor Ivanov and Rekha Singhal. „ABench: Big Data Architecture Stack Bench- mark“. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, April 09-13, 2018. 2018, pp. 13–16 (cit. on pp. 215, 223).

[Isl+12a] Mohammad Islam, Angelo K. Huang, Mohamed Battisha, et al. „Oozie: towards a scalable workflow management system for Hadoop“. In: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technolo- gies, SWEET@SIGMOD 2012, Scottsdale, AZ, USA, May 20, 2012. 2012, p. 4 (cit. on p. 34).

246 Bibliography [Isl+12b] Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, Jithin Jose, and Dha- baleswar K. Panda. „A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters“. In: Specifying Big Data Benchmarks - First Workshop, WBDB 2012, San Jose, CA, USA, May 8-9, 2012, and Second Workshop, WBDB 2012, Pune, India, December 17-18, 2012, Revised Selected Papers. 2012, pp. 129– 147 (cit. on pp. 73, 77).

[IT18] Todor Ivanov and Jason Taafe. „Exploratory Analysis of Spark Structured Streaming“. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, April 09-13, 2018. 2018, pp. 141–146 (cit. on p. 204).

[Iva+14a] Todor Ivanov, Raik Niemann, Sead Izberovic, et al. „Benchmarking DataStax En- terprise/Cassandra with HiBench (Technical Report)“. In: CoRR abs/1411.4044 (2014). arXiv: 1411.4044 (cit. on pp. 65, 69).

[Iva+14b] Todor Ivanov, Roberto V. Zicari, Sead Izberovic, and Karsten Tolle. „Perfor- mance Evaluation of Virtualized Hadoop Clusters (Technical Report)“. In: CoRR abs/1411.3811 (2014). arXiv: 1411.3811 (cit. on pp. 42, 66).

[Iva+15a] Todor Ivanov, Raik Niemann, Sead Izberovic, et al. „Performance Evaluation of Enterprise Big Data Platforms with HiBench“. In: 2015 IEEE TrustCom/Big- DataSE/ISPA, Helsinki, Finland, August 20-22, 2015, Volume 2. 2015, pp. 120– 127 (cit. on p. 65).

[Iva+15b] Todor Ivanov, Tilmann Rabl, Meikel Poess, et al. „Big Data Benchmark Com- pendium“. In: TPCTC. 2015, pp. 135–155 (cit. on p. 166).

[Iva+15c] Todor Ivanov, Tilmann Rabl, Meikel Poess, et al. „Big Data Benchmark Com- pendium“. In: Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. 2015, pp. 135–155 (cit. on p. 267).

[Iva+18] Todor Ivanov, Patrick Bedué, Ahmad Ghazal, and Roberto V. Zicari. „Adding Velocity to BigBench“. In: Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018. 2018, 6:1–6:6 (cit. on p. 192).

[IZB14] Todor Ivanov, Roberto V. Zicari, and Alejandro P. Buchmann. „Benchmarking Virtualized Hadoop Clusters“. In: Big Data Benchmarking - 5th International Workshop, WBDB 2014, Potsdam, Germany, August 5-6, 2014, Revised Selected Papers. 2014, pp. 87–98 (cit. on p. 42).

[Jac09] Adam Jacobs. „The pathologies of big data“. In: Commun. ACM 52.8 (2009), pp. 36–44 (cit. on p. 16).

[Jag+14] H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, et al. „Big data and its technical challenges“. In: Commun. ACM 57.7 (2014), pp. 86–94 (cit. on pp. 26, 43, 82).

[Jak12] Cassandra File System Design Jake Luciani. 2012. URL: http://www.datastax. com/dev/blog/cassandra-file-system-design (cit. on pp. 66, 67).

Bibliography 247 [Jep+18] Theo Jepsen, Masoud Moshref, Antonio Carzaniga, Nate Foster, and Robert Soulé. „Life in the Fast Lane: A Line-Rate Linear Road“. In: Proceedings of the Symposium on SDN Research, SOSR 2018, Los Angeles, CA, USA, March 28-29, 2018. 2018, 10:1–10:7 (cit. on p. 286).

[Jou+17] Norman P. Jouppi, Cliff Young, Nishant Patil, et al. „In-Datacenter Performance Analysis of a Tensor Processing Unit“. In: Proceedings of the 44th Annual Inter- national Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. 2017, pp. 1–12 (cit. on pp. v, 1).

[JR09] Flavio Paiva Junqueira and Benjamin Reed. „The life and times of a zookeeper“. In: Proceedings of the 28th Annual ACM Symposium on Principles of Distributed Computing, PODC 2009, Calgary, Alberta, Canada, August 10-12, 2009. 2009, p. 4 (cit. on p. 30).

[Kaf19] . 2019. URL: https://kafka.apache.org/ (cit. on pp. 199, 224).

[Kan+13] Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang, and Sang-Won Lee. „A case for flash memory ssd in hadoop applications“. In: International Journal of Control and Automation 6.1 (2013), pp. 201–210 (cit. on p. 28).

[KB17] Mayuresh Kunjir and Shivnath Babu. „Thoth in Action: Memory Management in Modern Data Analytics“. In: PVLDB 10.12 (2017) (cit. on p. 208).

[KC14] Karthik Kambatla and Yanpei Chen. „The Truth About MapReduce Performance on SSDs“. In: 28th Large Installation System Administration Conference, LISA ’14, Seattle, WA, USA, November 9-14, 2014. 2014, pp. 109–118 (cit. on p. 28).

[Ker+19] Martin L. Kersten, Stefan Manegold, Ying Zhang, and Panos Kuoutsourakis. „SQALPEL: A database performance platform“. In: CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. 2019 (cit. on p. 300).

[KFS18] C. Kachris, B. Falsafi, and D. Soudris. Hardware Accelerators in Data Centers. SPRINGER INTERNATIONAL PU, 2018 (cit. on pp. v, 1).

[Kha+04] Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. „Nutch: A flexible and scalable open-source web search engine“. In: Oregon State University 1 (2004), pp. 32–32 (cit. on p. 38).

[KHB16] Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. „Where does Memory Go?: Study of Memory Management in JVM-based Data Analytics“. In: (2016) (cit. on p. 208).

[Kho+93] Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban, and Cho-Li Wang. „Heterogeneous Computing: Challenges and Opportunities“. In: IEEE Computer 26.6 (1993), pp. 18–27 (cit. on p. 22).

[Kim+08] Kiyoung Kim, Kyungho Jeon, Hyuck Han, et al. „MRBench: A Benchmark for MapReduce Framework“. In: 14th International Conference on Parallel and Distributed Systems, ICPADS 2008, Melbourne, Victoria, Australia, December 8-10, 2008. 2008, pp. 11–18 (cit. on p. 280).

248 Bibliography [Kip+17] Andreas Kipf, Varun Pandey, Jan Böttcher, et al. „Analytics on Fast Data: Main- Memory Database Systems versus Modern Streaming Systems“. In: Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017. 2017, pp. 49–60 (cit. on pp. 194, 289).

[Kit15] Kite. 2015. URL: http://kitesdk.org/docs/1.0.0/ (cit. on p. 37).

[kit17] TPCx-BB kit. 2017. URL: https://github.com/intel-hadoop/Big-Data- Benchmark-for-Big-Bench (cit. on p. 196).

[KKB14] Mayuresh Kunjir, Prajakta Kalmegh, and Shivnath Babu. „Thoth: Towards Managing a Multi-System Cluster“. In: PVLDB 7.13 (2014), pp. 1689–1692 (cit. on p. 279).

[KKR14] Jörn Kuhlenkamp, Markus Klems, and Oliver Röss. „Benchmarking Scalabil- ity and Elasticity of Distributed Database Systems“. In: PVLDB 7.12 (2014), pp. 1219–1230 (cit. on pp. 67, 74).

[Klu+03] Yuval Kluger, Ronen Basri, Joseph T Chang, and Mark Gerstein. „Spectral biclustering of microarray data: coclustering genes and conditions“. In: Genome research 13.4 (2003), pp. 703–716 (cit. on p. 292).

[KN11] Alfons Kemper and Thomas Neumann. „HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots“. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany. 2011, pp. 195–206 (cit. on p. 18).

[KNR+11] Jay Kreps, Neha Narkhede, Jun Rao, et al. „Kafka: A distributed messaging system for log processing“. In: Proceedings of the NetDB. 2011, pp. 1–7 (cit. on p. 30).

[Kor+15] Marcel Kornacker, Alexander Behm, Victor Bittorf, et al. „Impala: A Modern, Open-Source SQL Engine for Hadoop“. In: CIDR 2015, Seventh Biennial Con- ference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. 2015 (cit. on pp. 36, 121, 122).

[Kra+13] Tim Kraska, Ameet Talwalkar, John C. Duchi, et al. „MLbase: A Distributed Machine-learning System“. In: CIDR 2013, Sixth Biennial Conference on Inno- vative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings. 2013 (cit. on p. 35).

[Kre14] Jay Kreps. „Questioning the lambda architecture“. In: Online article, July (2014) (cit. on p. 205).

[KRM18] Jeyhun Karimov, Tilmann Rabl, and Volker Markl. „PolyBench: The First Bench- mark for Polystores“. In: Performance Evaluation and Benchmarking for the Era of Artificial Intelligence - 10th TPC Technology Conference, TPCTC 2018, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers. 2018, pp. 24–41 (cit. on p. 299).

[KTF09] U Kang, Charalampos E Tsourakakis, and Christos Faloutsos. „Pegasus: A peta- scale graph mining system implementation and observations“. In: Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE. 2009, pp. 229– 238 (cit. on p. 34).

[Kub19a] Kubernetes. 2019. URL: https://kubernetes.io/ (cit. on p. 223).

Bibliography 249 [Kub19b] Apache Kubernetes. 2019. URL: https : / / kubernetes . io / docs / setup / minikube/ (cit. on p. 224).

[Lab15] Frankfurt Big Data Lab. 2015. URL: https://github.com/BigData- Lab- Frankfurt/HiBench-DSE (cit. on p. 69).

[Lab18] Dream Lab. 2018. URL: https://github.com/dream-lab/riot-bench (cit. on p. 288).

[Lan01] Doug Laney. „3D data management: Controlling data volume, velocity and variety“. In: META group research note 6.70 (2001), p. 1 (cit. on pp. v, 1, 16).

[LDB18] LDBC. 2018. URL: www.ldbcouncil.org (cit. on pp. 270, 271).

[LDB19a] LDBC. 2019. URL: http://ldbcouncil.org/developer/spb (cit. on p. 297).

[LDB19b] LDBC. 2019. URL: http://ldbcouncil.org/benchmarks/snb (cit. on p. 297).

[LES12] Jonathan Leibiusky, Gabriel Eisbruch, and Dario Simonassi. Getting Started with Storm - Continuous Streaming Computation with Twitter’s Cluster Technology. O’Reilly, 2012 (cit. on p. 33).

[Li+13a] Haoyuan Li, Ali Ghodsi, Matei Zaharia, et al. „Tachyon: Memory throughput i/o for cluster computing frameworks“. In: memory 18 (2013), p. 1 (cit. on p. 32).

[Li+13b] Jack Li, Qingyang Wang, Deepal Jayasinghe, et al. „Performance Overhead among Three Hypervisors: An Experimental Study Using Hadoop Benchmarks“. In: IEEE International Congress on Big Data, BigData Congress 2013, Santa Clara, CA, USA, June 27 2013-July 2, 2013. 2013, pp. 9–16 (cit. on pp. 46, 47, 64).

[Li+14] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. „Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks“. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 3-5, 2014. 2014, 6:1–6:15 (cit. on p. 32).

[Li+15] Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. „Spark- Bench: A Comprehensive Benchmarking Suite for In Memory Data Analytic Platform Spark“. In: Proceedings of the 12th ACM International Conference on Computing Frontiers. 2015, 53:1–53:8 (cit. on pp. 165, 166, 193, 199).

[Li+17] Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. „Spark- Bench: a spark benchmarking suite characterizing large-scale in-memory data analytics“. In: Cluster Computing 20.3 (2017), pp. 2575–2589 (cit. on pp. 208, 272).

[Liu+16] Zhen Hua Liu, Beda Christoph Hammerschmidt, Doug McMahon, Ying Liu, and Hui Joe Chang. „Closing the Functional and Performance Gap between SQL and NoSQL“. In: SIGMOD. 2016, pp. 227–238 (cit. on p. 164).

[Liu+18] Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. „MLBench: Benchmarking Machine Learning Services Against Human Experts“. In: PVLDB 11.10 (2018), pp. 1220–1232 (cit. on p. 293).

[LK11] Gunho Lee and Randy H. Katz. „Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud“. In: 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’11, Portland, OR, USA, June 14-15, 2011. 2011 (cit. on p. 23).

250 Bibliography [LM10] Avinash Lakshman and Prashant Malik. „Cassandra: a decentralized structured storage system“. In: Operating Systems Review 44.2 (2010), pp. 35–40 (cit. on pp. 66, 74).

[LM14] X Li and J Murray. „Deploying Virtualized Hadoop Systems with VMWare vSphere Big Data Extensions“. In: Tech. White Pap. VMware Inc (2014) (cit. on p. 46).

[Low+12] Yucheng Low, Danny Bickson, Joseph Gonzalez, et al. „Distributed GraphLab: a framework for machine learning and data mining in the cloud“. In: Proceedings of the VLDB Endowment 5.8 (2012), pp. 716–727 (cit. on p. 34).

[LSA14] Tomislav Lipic, Karolj Skala, and Enis Afgan. „Deciphering big data stacks: An overview of big data tools“. In: Fifth International Workshop on Big Data Analytics: Challenges, and Opportunities (BDAC-14). 2014 (cit. on p. 26).

[Lu+14] Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. „Stream Bench: Towards Bench- marking Modern Distributed Stream Computing Frameworks“. In: Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2014, London, United Kingdom, December 8-11, 2014. 2014, pp. 69–78 (cit. on pp. 193, 204, 287).

[Luo+12a] Chunjie Luo, Jianfeng Zhan, Zhen Jia, et al. „CloudRank-D: Benchmarking and Ranking Cloud Computing Systems for Data Processing Applications“. In: Frontiers of Computer Science 6.4 (2012), pp. 347–362 (cit. on pp. 165, 166).

[Luo+12b] Chunjie Luo, Jianfeng Zhan, Zhen Jia, et al. „CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications“. In: Frontiers of Computer Science 6.4 (2012), pp. 347–362 (cit. on pp. 280, 281).

[Mag+13] Tariq Magdon-Ismail, M Nelson, R Cheveresan, et al. „Toward an Elastic Elephant–Enabling Hadoop for the Cloud“. In: VMware Technical Journal, Win- ter (2013) (cit. on pp. 52, 63).

[Mal+10] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al. „Pregel: a system for large-scale graph processing“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010. 2010, pp. 135–146 (cit. on p. 34).

[Man+11] James Manyika, Michael Chui, Brad Brown, et al. „Big data: The next frontier for innovation, competition, and productivity“. In: (2011) (cit. on pp. 43, 82, 104, 130, 164, 165, 172).

[Mar+16] Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, and Marıa S. Pérez- Hernández. „Spark Versus Flink: Understanding Performance in Big Data An- alytics Frameworks“. In: the IEEE CLUSTER 2016, Taipei, Taiwan, 2016. 2016 (cit. on p. 208).

[Mel+10] Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al. „Dremel: Interactive Analysis of Web-Scale Datasets“. In: PVLDB 3.1 (2010), pp. 330–339 (cit. on pp. 120, 126).

[MG+11] Peter Mell, Tim Grance, et al. „The NIST definition of cloud computing“. In: (2011) (cit. on p. 19).

[Mic13] Microsoft. „Performance of Hadoop on Windows in Hyper-V Environments“. In: Technical White Paper (2013) (cit. on p. 42).

Bibliography 251 [Min+13] Zijian Ming, Chunjie Luo, Wanling Gao, et al. „BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking“. In: Advancing Big Data Benchmarks - Proceedings of the 2013 Workshop Series on Big Data Benchmarking, WBDB.cn, Xi’an, China, July 16-17, 2013 and WBDB.us, San José, CA, USA, October 9-10, 2013 Revised Selected Papers. 2013, pp. 138–154 (cit. on p. 279).

[Min15] Min Li. SparkBench. 2015. URL: https://bitbucket.org/lm0926/sparkbench (cit. on p. 3).

[ML19] System ML. 2019. URL: https://systemml.apache.org/ (cit. on p. 226).

[MLl19] Spark MLlib. 2019. URL: https://spark.apache.org/mllib/ (cit. on p. 226).

[MLP18] MLPerf. 2018. URL: https://mlperf.org/ (cit. on p. 296).

[Mor13] Gianmarco De Francisci Morales. „SAMOA: a platform for mining big data streams“. In: 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, Companion Volume. 2013, pp. 777–778 (cit. on p. 39).

[MRB13] MRBS. MRBS. 2013. URL: http://sardes.inrialpes.fr/research/mrbs/ index.html (cit. on p. 281).

[MT09] Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. 2nd. New York, NY, USA: Cambridge University Press, 2009 (cit. on pp. 164, 181).

[MTH11] Jason Mars, Lingjia Tang, and Robert Hundt. „Heterogeneity in “homogeneous” warehouse-scale computers: A performance opportunity“. In: IEEE Computer Architecture Letters 2 (2011), pp. 29–32 (cit. on p. 23).

[MW15] Nathan Marz and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015 (cit. on pp. 24, 205).

[MZ11] Chris Mattmann and Jukka Zitting. Tika in action. Manning Publications Co., 2011 (cit. on p. 39).

[Nam+14] Raghunath Othayoth Nambiar, Meikel Poess, Akon Dey, et al. „Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems“. In: Performance Characterization and Benchmarking. Traditional to Big Data - 6th TPC Technology Conference, TPCTC 2014, Hangzhou, China, September 1-5, 2014. Revised Selected Papers. 2014, pp. 1–12 (cit. on pp. 81, 86, 104, 274).

[Nam14] Raghunath Nambiar. „Benchmarking Big Data Systems: Introducing TPC Ex- press Benchmark HS“. In: Big Data Benchmarking - 5th International Workshop, WBDB 2014, Potsdam, Germany, August 5-6, 2014, Revised Selected Papers. 2014, pp. 24–28 (cit. on p. 268).

[Nam18] Raghunath Nambiar. 2018. URL: https://blogs.cisco.com/datacenter/ tpc-iot (cit. on p. 276).

[Nog+17] Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, et al. „Stateful Scalable Stream Processing at LinkedIn“. In: PVLDB 10.12 (2017), pp. 1634–1645 (cit. on p. 192).

252 Bibliography [NP06] Raghunath Othayoth Nambiar and Meikel Poess. „The Making of TPC-DS“. In: Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006. 2006, pp. 1049–1058 (cit. on p. 268).

[NP15] Raghunath Nambiar and Meikel Poess. „Reinventing the TPC: From Traditional to Big Data to Internet of Things“. In: Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. 2015, pp. 1–7 (cit. on p. 276).

[NR16] Axel-Cyrille Ngonga Ngomo and Michael Röder. „HOBBIT: Holistic benchmark- ing for big linked data“. In: ERCIM News 2016.105 (2016) (cit. on p. 300).

[Ols+08a] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. „Pig latin: a not-so-foreign language for data processing“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008. 2008, pp. 1099– 1110 (cit. on pp. 39, 277).

[Ols+08b] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. „Pig latin: a not-so-foreign language for data processing“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008. 2008, pp. 1099– 1110 (cit. on p. 122).

[Ols+17] Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. „PMLB: a large benchmark suite for machine learning evaluation and comparison“. In: BioData Mining 10.1 (2017), 36:1–36:13 (cit. on p. 292).

[One13] Monash C. One database to rule them all? 2013. URL: http://www.dbms2. com/2013/02/21/one-database-to-rule-them-all/ (cit. on p. 23).

[OO12] Sean Owen and Sean Owen. „Mahout in action“. In: (2012) (cit. on p. 38).

[Ope19] Operator. 2019. URL: https://coreos.com/operators/ (cit. on p. 223).

[Ous+15] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. „Making Sense of Performance in Data Analytics Frameworks“. In: the 12th USENIX NSDI 15, Oakland, CA, USA, May 4-6, 2015. 2015 (cit. on p. 208).

[Ous+17a] Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. „Monotasks: Architecting for Performance Clarity in Data Analytics Frame- works“. In: 26th SOSP, Shanghai, China, 2017. 2017 (cit. on p. 208).

[Ous+17b] Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, and Scott Shenker. „Performance clarity as a first-class design principle“. In: the 16th Workshop HotOS 2017, Whistler, BC, Canada, May 8-10, 2017. 2017 (cit. on p. 208).

[Ove18] ImageNet Overview. 2018. URL: http://image-net.org/about-overview (cit. on p. 293).

[Ozd18] Muhammet Mustafa Ozdal. „Emerging Accelerator Platforms for Data Centers“. In: IEEE Design & Test 35.1 (2018), pp. 47–54 (cit. on pp. v, 1).

Bibliography 253 [PAC18] Ivens Portugal, Paulo S. C. Alencar, and Donald D. Cowan. „The use of machine learning algorithms in recommender systems: A systematic review“. In: Expert Syst. Appl. 97 (2018), pp. 205–227 (cit. on pp. v, 1).

[Pat+11] Swapnil Patil, Milo Polte, Kai Ren, et al. „YCSB++: benchmarking and per- formance debugging advanced features in scalable table stores“. In: ACM Symposium on Cloud Computing in conjunction with SOSP 2011, SOCC ’11, Cascais, Portugal, October 26-28, 2011. 2011, p. 9 (cit. on pp. 67, 273).

[Pat+16] Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale. „SamzaSQL: Scalable Fast Data Management with Streaming SQL“. In: 2016 IEEE IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016. 2016 (cit. on p. 192).

[Pav+09a] Andrew Pavlo, Erik Paulson, Alexander Rasin, et al. „A Comparison of Ap- proaches to Large-Scale Data Analysis“. In: SIGMOD. 2009, pp. 165–178 (cit. on p. 282).

[Pav+09b] Andrew Pavlo, Erik Paulson, Alexander Rasin, et al. „A comparison of ap- proaches to large-scale data analysis“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009. 2009, pp. 165–178 (cit. on pp. 77, 166, 278).

[PC06] Increasing Data Center Density While Driving Down Power and Intel Cooling Costs. 2006. URL: https : / / www . intel . com / content / dam / doc / white - paper/increasing-data-center-density-paper.pdf (cit. on p. 27).

[PCW15] Pouria Pirzadeh, Michael J. Carey, and Till Westmann. „BigFUN: A Perfor- mance Study of Big Data Management System Functionality“. In: 2015 IEEE International Conference on Big Data. 2015, pp. 507–514 (cit. on pp. 165, 280).

[PCW17] Pouria Pirzadeh, Michael J. Carey, and Till Westmann. „A performance study of big data analytics platforms“. In: 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11-14, 2017. 2017, pp. 2911– 2920 (cit. on p. 127).

[Ped+11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et al. „Scikit-learn: Machine Learning in Python“. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830 (cit. on p. 292).

[PF00] Meikel Pöss and Chris Floyd. „New TPC Benchmarks for Decision Support and Web Commerce“. In: SIGMOD Record 29.4 (2000), pp. 64–71 (cit. on p. 268).

[Pir] Pouria Pirzadeh. URL: https://github.com/pouriapirz/bigFUN (cit. on p. 280).

[Pla09] Hasso Plattner. „A common database approach for OLTP and OLAP using an in-memory column database“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009. 2009, pp. 1–2 (cit. on p. 18).

[Pla15] Rackspace Cloud Big Data Platform. 2015. URL: http://www.rackspace.com/ cloud/big-data/ (cit. on p. 21).

254 Bibliography [PMC17] Nicolás Poggi, Alejandro Montero, and David Carrera. „Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments“. In: Performance Evalu- ation and Benchmarking for the Analytics Era - 9th TPC Technology Conference, TPCTC 2017, Munich, Germany, August 28, 2017, Revised Selected Papers. 2017, pp. 55–74 (cit. on pp. 127, 128).

[PNW07] Meikel Pöss, Raghunath Othayoth Nambiar, and David Walrath. „Why You Should Run TPC-DS: A Workload Analysis“. In: Proceedings of the 33rd Inter- national Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007. 2007, pp. 1138–1149 (cit. on p. 268).

[Poe+14] Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. „TPC-DI: The First Industry Benchmark for Data Integration“. In: PVLDB 7.13 (2014), pp. 1367–1378 (cit. on p. 268).

[Pog+14] Nicolás Poggi, David Carrera, Aaron Call, et al. „ALOJA: A systematic study of Hadoop deployment variables to enable automated characterization of cost- effectiveness“. In: 2014 IEEE Intl. Conf. on Big Data, Big Data 2014, Washington, DC, USA, October 27-30, 2014. 2014, pp. 905–913 (cit. on p. 299).

[Pog+16] Nicolás Poggi, Josep Lluis Berral, Thomas Fenech, et al. „The state of SQL- on-Hadoop in the cloud“. In: 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016. 2016, pp. 1432–1443 (cit. on pp. 127, 128).

[Pog18] Nicolas Poggi. „Microbenchmark“. In: Encyclopedia of Big Data Technologies. Ed. by Sherif Sakr and Albert Zomaya. Cham: Springer International Publishing, 2018, pp. 1–10 (cit. on p. 3).

[Pol18] Matthias Polag. „Enriching the Machine Learning Workloads of BigBench“. Master Thesis. Goethe University Frankfurt, 2018 (cit. on p. 225).

[Pös17] Meikel Pöss. „Methodologies for a Comprehensive Approach to Measuring the Performance of Decision Support Systems“. PhD thesis. Technische Universität München, 2017 (cit. on p. 276).

[Pre] Presto. URL: www.prestodb.io (cit. on pp. 36, 121, 122, 163).

[PRJ17] Meikel Poess, Tilmann Rabl, and Hans-Arno Jacobsen. „Analysis of TPC-DS: the first standard benchmark for SQL-based big data systems“. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24 - 27, 2017. 2017, pp. 573–585 (cit. on pp. 268, 285).

[Pro13] Grover M. Processing frameworks for Hadoop. 2013. URL: http://radar. oreilly.com/2015/02/processing-frameworks-for-hadoop.html (cit. on p. 31).

[Pro18] Huawei-AIM Benchmark Tell Project. 2018. URL: https : / / github . com / tellproject/aim-benchmark (cit. on pp. 289, 290).

[PS16] Sankaralingam Panneerselvam and Michael Swift. „Rinnegan: Efficient Re- source Use in Heterogeneous Architectures“. In: PACT 2016. Haifa, Israel: ACM, 2016, pp. 373–386 (cit. on p. 215).

Bibliography 255 [Qin+13] Xiongpai Qin, Biao Qin, Xiaoyong Du, and Shan Wang. „Reflection on the Popularity of MapReduce and Observation of Its Position in a Unified Big Data Platform“. In: Web-Age Information Management - WAIM 2013 International Workshops: HardBD, MDSP, BigEM, TMSN, LQPM, BDMS, Beidaihe, China, June 14-16, 2013. Proceedings. 2013, pp. 339–347 (cit. on p. 25).

[Qiu+14] Judy Qiu, Shantenu Jha, Andre Luckow, and Geoffrey C Fox. „Towards HPC- ABDS: an initial high-performance big data stack“. In: Building Robust Big Data Ecosystem ISO/IEC JTC 1 (2014), pp. 18–21 (cit. on p. 26).

[Raa93] Francois Raab. „TPC-C - The Standard Benchmark for Online transaction Processing (OLTP)“. In: The Benchmark Handbook for Database and Transaction Systems (2nd Edition). 1993 (cit. on p. 268).

[Rab+10a] Tilmann Rabl, Michael Frank, Hatem Mousselly Sergieh, and Harald Kosch. „A Data Generator for Cloud-Scale Benchmarking“. In: Performance Evaluation, Measurement and Characterization of Complex Systems - Second TPC Technology Conference, TPCTC 2010, Singapore, September 13-17, 2010. Revised Selected Papers. 2010, pp. 41–56 (cit. on p. 104).

[Rab+10b] Tilmann Rabl, Michael Frank, Hatem Mousselly Sergieh, and Harald Kosch. „A Data Generator for Cloud-Scale Benchmarking“. In: Performance Evaluation, Measurement and Characterization of Complex Systems - Second TPC Technology Conference, TPCTC 2010, Singapore, September 13-17, 2010. Revised Selected Papers. 2010, pp. 41–56 (cit. on p. 133).

[Rab+12a] Tilmann Rabl, Ahmad Ghazal, Minqing Hu, et al. „BigBench Specification V0.1 - BigBench: An Industry Standard Benchmark for Big Data Analytics“. In: Proceedings of the 2012 Workshop on Big Data Benchmarking. 2012, pp. 164–201 (cit. on p. 165).

[Rab+12b] Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen, et al. „Solving Big Data Challenges for Enterprise Application Performance Management“. In: PVLDB 5.12 (2012), pp. 1724–1735 (cit. on p. 67).

[Rab+14] Tilmann Rabl, Ahmad Ghazal, Minqing Hu, et al. „BigBench Specification V0.1“. In: Specifying Big Data Benchmarks. Ed. by Tilmann Rabl, Meikel Poess, Chaitanya Baru, and Hans-Arno Jacobsen. Vol. 8163. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2014, pp. 164–201 (cit. on pp. 114, 116).

[Raj+18] Sreeraj Rajendran, Roberto Calvo-Palomino, Markus Fuchs, et al. „Electrosense: Open and Big Spectrum Data“. In: IEEE Communications Magazine 56.1 (Jan. 2018), pp. 210–217 (cit. on p. 158).

[RCL09] Bhaskar Prasad Rimal, Eunmi Choi, and Ian Lumb. „A Taxonomy and Survey of Cloud Computing Systems“. In: International Conference on Networked Com- puting and Advanced Information Management, NCM 2009, Fifth International Joint Conference on INC, IMS and IDC: INC 2009: International Conference on Networked Computing, IMS 2009: International Conference on Advanced Infor- mation Management and Service, IDC 2009: International Conference on Digital Content, Multimedia Technology and its Applications, Seoul, Korea, August 25-27, 2009. 2009, pp. 44–51 (cit. on pp. 19, 42).

256 Bibliography [RK10] Ariel Rabkin and Randy H. Katz. „Chukwa: A System for Reliable Large-Scale Log Collection“. In: Uncovering the Secrets of System Administration: Proceedings of the 24th Large Installation System Administration Conference, LISA 2010, San Jose, CA, USA, November 7-12, 2010. 2010 (cit. on p. 33).

[Sah14] OpenStack Sahara. 2014. URL: https://wiki.openstack.org/wiki/Sahara (cit. on p. 29).

[Sak+15] Sherif Sakr, Amin Shafaat, Fuad Bajaber, et al. „Liquid Benchmarking: A Plat- form for Democratizing the Performance Evaluation Process“. In: Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015. 2015, pp. 537–540 (cit. on p. 299).

[Sam12] Eric Sammer. Hadoop operations. " O’Reilly Media, Inc.", 2012 (cit. on p. 71).

[Sar+00] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. „Analysis of recommendation algorithms for e-commerce“. In: Proceedings of the 2nd ACM conference on Electronic commerce. ACM. 2000, pp. 158–167 (cit. on p. 225).

[SC11] Sherif Sakr and Fabio Casati. „Liquid Benchmarks: Towards an Online Platform for Collaborative Assessment of Computer Science Research Results“. In: Pro- ceedings of the Second TPC Technology Conference on Performance Evaluation, Measurement and Characterization of Complex Systems. TPCTC’10. Singapore: Springer-Verlag, 2011, pp. 10–24 (cit. on p. 299).

[SCS17] Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. „RIoTBench: A Real- time IoT Benchmark for Distributed Stream Processing Platforms“. In: CoRR abs/1701.08530 (2017). arXiv: 1701.08530 (cit. on pp. 204, 288).

[SÇZ05] Michael Stonebraker, Ugur Çetintemel, and Stanley B. Zdonik. „The 8 require- ments of real-time stream processing“. In: SIGMOD Record 34.4 (2005), pp. 42– 47 (cit. on p. 192).

[Ser15] Red Hat OpenShift Platform as a Service. 2015. URL: https://www.openshift. com/products/ (cit. on p. 21).

[Sha+10] Yi Shan, Bo Wang, Jing Yan, et al. „FPMR: MapReduce framework on FPGA“. In: Proceedings of the ACM/SIGDA 18th International Symposium on Field Pro- grammable Gate Arrays, FPGA 2010, Monterey, California, USA, February 21-23, 2010. 2010, pp. 93–102 (cit. on p. 28).

[Sha14] Saeed Shahrivari. „Beyond Batch Processing: Towards Real-Time and Streaming Big Data“. In: Computers 3.4 (2014) (cit. on p. 205).

[She15] Sherif Sakr. Liquid benchmarking. 2015. URL: http://wiki.liquidbenchmark. net/doku.php/home (cit. on p. 299).

[Shi+15] Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, et al. „Clash of the Titans: MapRe- duce vs. Spark for Large Scale Data Analytics“. In: PVLDB 8.13 (2015) (cit. on p. 208).

[Shi+18] Xuanhua Shi, Zhigao Zheng, Yongluan Zhou, et al. „Graph Processing on GPUs: A Survey“. In: ACM Comput. Surv. 50.6 (2018), 81:1–81:35 (cit. on pp. v, 1).

[Shv+10] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. „The Hadoop Distributed File System“. In: 26th IEEE Symposium on Mass Storage Systems and Technologies. 2010, pp. 1–10 (cit. on pp. 44, 74, 82).

Bibliography 257 [Sin16] Sweta Singh. „Benchmarking Spark Machine Learning Using BigBench“. In: 8th TPC Technology Conference, TPCTC 2016, New Delhi, India, September 5-9, 2016. 2016 (cit. on pp. 222, 225).

[SLF13] Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. „The family of mapreduce and large-scale data processing systems“. In: ACM Comput. Surv. 46.1 (2013), 11:1–11:44 (cit. on pp. 18, 25, 65).

[SM13] Rainer Schmidt and Michael Möhring. „Strategic Alignment of Cloud-Based Architectures for Big Data“. In: 17th IEEE International Enterprise Distributed Ob- ject Computing Conference Workshops, EDOC Workshops, Vancouver, BC, Canada, September 9-13, 2013. 2013, pp. 136–143 (cit. on p. 42).

[Spa17] Apache Spark. 2017. URL: https://spark.apache.org/ (cit. on pp. 163, 178, 208).

[SPE18] SPEC. 2018. URL: www.spec.org (cit. on pp. 3, 269).

[Spo18] Sports-1M-dataset. 2018. URL: https://github.com/gtoderici/sports-1m- dataset/ (cit. on p. 293).

[SS16a] Anshu Shukla and Yogesh Simmhan. „Benchmarking Distributed Stream Pro- cessing Platforms for IoT Applications“. In: 8th TPCTC 2016, New Delhi, India, Sept. 5-9, 2016. 2016, pp. 90–106 (cit. on pp. 194, 199).

[SS16b] Anshu Shukla and Yogesh Simmhan. „Benchmarking Distributed Stream Pro- cessing Platforms for IoT Applications“. In: Performance Evaluation and Bench- marking. Traditional - Big Data - Interest of Things - 8th TPC Technology Con- ference, TPCTC 2016, New Delhi, India, September 5-9, 2016, Revised Selected Papers. 2016, pp. 90–106 (cit. on p. 288).

[SS17] Rekha Singhal and Praveen Singh. „Performance Assurance Model for Applica- tions on Spark Platform“. In: 9th TPC Technology Conference 2017. 2017 (cit. on p. 216).

[SSB12a] A. Sangroya, D. Serrano, and S. Bouchenak. MRBS: A Comprehensive MapReduce Benchmark Suite. Tech. rep. LIG Grenoble Fr, 2012 (cit. on p. 281).

[SSB12b] Amit Sangroya, Damián Serrano, and Sara Bouchenak. „MRBS: Towards De- pendability Benchmarking for Hadoop MapReduce“. In: Euro-Par: Parallel Processing Workshops. 2012, pp. 3–12 (cit. on pp. 165, 166).

[SSB12c] Amit Sangroya, Damián Serrano, and Sara Bouchenak. „MRBS: Towards De- pendability Benchmarking for Hadoop MapReduce“. In: Euro-Par 2012: Parallel Processing Workshops - BDMC, CGWS, HeteroPar, HiBB, OMHI, Paraphrase, PROPER, Resilience, UCHPC, VHPC, Rhodes Islands, Greece, August 27-31, 2012. Revised Selected Papers. 2012, pp. 3–12 (cit. on p. 281).

[SSK11] Christof Strauch, Ultra-Large Scale Sites, and Walter Kriha. „NoSQL databases“. In: Lecture Notes, Stuttgart Media University 20 (2011) (cit. on p. 24).

[ST10] Priya Sethuraman and H. Reza Taheri. „TPC-V: A Benchmark for Evaluating the Performance of Database Applications in Virtual Environments“. In: Performance Evaluation, Measurement and Characterization of Complex Systems - Second TPC Technology Conference, TPCTC 2010, Singapore, September 13-17, 2010. Revised Selected Papers. 2010, pp. 121–135 (cit. on p. 268).

258 Bibliography [Sta+08] James Staten, Simon Yates, Frank E Gillett, Walid Saleh, and Rachel A Dines. „Is cloud computing ready for the enterprise“. In: Forrester Research 400 (2008) (cit. on p. 30).

[STA18] STAC. 2018. URL: www.stacresearch.com (cit. on p. 270).

[Sto+07] Michael Stonebraker, Chuck Bear, Ugur Çetintemel, et al. „One Size Fits All? Part 2: Benchmarking Studies“. In: CIDR 2007, Third Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 7-10, 2007, Online Proceedings. 2007, pp. 173–184 (cit. on p. 23).

[Sto+10] Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, et al. „MapReduce and parallel DBMSs: friends or foes?“ In: Commun. ACM 53.1 (2010), pp. 64–71 (cit. on pp. 278, 282).

[Sto13] Storm-YARN. 2013. URL: https://github.com/yahoo/storm-yarn (cit. on pp. 31, 33).

[Str17a] Flink Streaming. 2017. URL: https://ci.apache.org/projects/flink/ flink- docs- release- 1.4/dev/table/streaming.html (cit. on pp. 192, 205).

[Str17b] Spark Streaming. 2017. URL: https://spark.apache.org/streaming/ (cit. on pp. 200, 205, 208).

[Str17c] Spark Structured Streaming. 2017. URL: https://spark.apache.org/docs/ latest/structured-streaming-programming-guide.html (cit. on pp. 192, 224).

[Sun+14a] Liwen Sun, Michael J. Franklin, Sanjay Krishnan, and Reynold S. Xin. „Fine- grained partitioning for aggressive data skipping“. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 2014, pp. 1115–1126 (cit. on p. 129).

[Sun+14b] Liwen Sun, Sanjay Krishnan, Reynold S. Xin, and Michael J. Franklin. „A Partitioning Framework for Aggressive Data Skipping“. In: PVLDB 7.13 (2014), pp. 1617–1620 (cit. on p. 129).

[Sun+16] Liwen Sun, Michael J. Franklin, Jiannan Wang, and Eugene Wu. „Skipping- oriented Partitioning for Columnar Layouts“. In: PVLDB 10.4 (2016), pp. 421– 432 (cit. on p. 129).

[SV16] Rekha Singhal and Abhishek Verma. „Predicting Job Completion Time in Het- erogeneous MapReduce Environments“. In: IPDPS Work. 2016, Chicago, USA, May 23-27. 2016 (cit. on p. 216).

[Taa17] Jason Taaffe. „Extending BigBench using Structured Streaming in Apache Spark“. Master Thesis. Goethe University Frankfurt, 2017 (cit. on p. 204).

[Taa18] Jason Taaffe. 2018. URL: https://github.com/Taaffy/Structured-Streaming- Micro-Benchmark (cit. on p. 210).

[Tal+12] Ameet Talwalkar, Tim Kraska, Rean Griffith, et al. „Mlbase: A distributed machine learning wrapper“. In: NIPS Big Learning Workshop. 2012 (cit. on p. 35).

[Tan14] Khaled Tannir. Optimizing Hadoop for MapReduce. Packt Publishing Ltd, 2014 (cit. on pp. 94, 97).

Bibliography 259 [TC13] Kathleen Ting and Jarek Jarcec Cecho. Apache Sqoop Cookbook: Unlocking Hadoop for Your Relational Database. " O’Reilly Media, Inc.", 2013 (cit. on pp. 35, 36, 44, 83).

[Tez15] Apache Tez. 2015. URL: http://hortonworks.com/hadoop/tez/ (cit. on p. 37).

[Thu+09] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al. „Hive - A Warehousing Solution Over a Map-Reduce Framework“. In: PVLDB 2.2 (2009), pp. 1626– 1629 (cit. on pp. 35, 36, 120–123, 175).

[Thu+10] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al. „Hive - a petabyte scale data warehouse using Hadoop“. In: Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. 2010, pp. 996–1005 (cit. on pp. 35, 36, 44, 77, 83, 120–123).

[Too18] Intel: PAT Tool. 2018. URL: https://github.com/intel-hadoop/PAT (cit. on pp. 93, 113, 148).

[Tos+14] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, et al. „Storm@twitter“. In: SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 2014, pp. 147–156 (cit. on p. 192).

[TPCa] TPC-DS. URL: www.tpc.org/tpcds/default.asp (cit. on pp. 104, 107, 116, 165).

[TPCb] TPCx-HS. URL: www.tpc.org/tpcx-hs/default.asp (cit. on pp. 81, 86, 87, 90, 166).

[TPC17] TPCx-BB. 2017. URL: www.tpc.org/tpcx-bb/default.asp (cit. on pp. 3, 166, 173, 193, 194).

[TPC18a] TPC. 2018. URL: www.tpc.org (cit. on pp. 3, 216, 268, 269).

[TPC18b] TPC. 2018. URL: http://www.tpc.org/tpc_documents_current_versions/ pdf/tpcx-bb_v1.2.0.pdf (cit. on pp. 104, 129, 130, 164, 278).

[TPC18c] TPC. 2018. URL: http://www.tpc.org/tpc_documents_current_versions/ pdf/tpcx-iot_v1.0.3.pdf (cit. on p. 276).

[Tra15] Transaction Processing Performance Council. TPC Express Benchmark HS - Standard Specification. Version 1.3.0. 2015 (cit. on p. 275).

[Tri+18] Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler. „Albis: High-Performance File Format for Big Data Systems“. In: 2018 USENIX Annual Technical Conference (USENIX18). USENIX Association. 2018 (cit. on p. 129).

[Uni12] Faraz Ahmad (Perdue University). PUMA Benchmarks. 2012. URL: https:// engineering.purdue.edu/~puma/pumabenchmarks.htm (cit. on pp. 3, 277).

[Vav+13] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al. „Apache Hadoop YARN: yet another resource negotiator“. In: ACM Symposium on Cloud Com- puting, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013. 2013, 5:1–5:16 (cit. on pp. 29, 70, 83).

260 Bibliography [VBR06] Srikumar Venugopal, Rajkumar Buyya, and Kotagiri Ramamohanarao. „A taxon- omy of Data Grids for distributed data sharing, management, and processing“. In: ACM Comput. Surv. 38.1 (2006), p. 3 (cit. on p. 23).

[Vei+16] Jorge Veiga, Roberto R. Expósito, Xoan C. Pardo, Guillermo L. Taboada, and Juan Touriño. „Performance evaluation of big data frameworks for large-scale data analytics“. In: 2016 IEEE BigData 2016, Washington DC, USA, Dec. 5-8, 2016. 2016 (cit. on p. 208).

[Ven+17] Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, et al. „Drizzle: Fast and Adaptable Stream Processing at Scale“. In: 26th SOSP, Shanghai, China, 2017. 2017 (cit. on p. 208).

[VF18] Kizheppatt Vipin and Suhaib A. Fahmy. „FPGA Dynamic and Partial Reconfigu- ration: A Survey of Architectures, Methods, and Applications“. In: ACM Comput. Surv. 51.4 (2018), 72:1–72:39 (cit. on pp. v, 1).

[VPK18] Anuj Vaishnav, Khoa Dang Pham, and Dirk Koch. „A Survey on FPGA Virtu- alization“. In: 28th International Conference on Field Programmable Logic and Applications, FPL 2018, Dublin, Ireland, August 27-31, 2018. 2018, pp. 131–138 (cit. on pp. v, 1).

[Wan+14a] Lei Wang, Jianfeng Zhan, Chunjie Luo, et al. „BigDataBench: A Big Data Bench- mark Suite from Internet Services“. In: 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014. 2014, pp. 488–499 (cit. on p. 165).

[Wan+14b] Lei Wang, Jianfeng Zhan, Chunjie Luo, et al. „BigDataBench: A big data bench- mark suite from internet services“. In: 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15-19, 2014. 2014, pp. 488–499 (cit. on p. 279).

[WBR17] Alex Watson, Deepigha Shree Vittal Babu, and Suprio Ray. „Sanzu: A data sci- ence benchmark“. In: 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11-14, 2017. 2017, pp. 263–272 (cit. on p. 291).

[WC14] Brian Wooden and Jack Coates. Building A Common Information Model(CIM) Compliant Technical Add-on (TA). 2014 (cit. on p. 164).

[Whi12] Tom White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly, 2012 (cit. on pp. 65, 67, 73).

[Win+16] Wolfram Wingerath, Felix Gessert, Steffen Friedrich, and Norbert Ritter. „Real- time stream processing for Big Data“. In: Information Technology 58.4 (2016) (cit. on pp. 204, 205).

[WLS14] Haoliang Wang, Wei Liu, and Tolga Soyata. „Accessing big data in the cloud using mobile devices“. In: Handbook of Research on Cloud Infrastructures for Big Data Analytics. IGI Global, 2014, pp. 444–470 (cit. on p. 20).

[Wou+15] Stefan van Wouw, José Viña, Alexandru Iosup, and Dick H. J. Epema. „An Empirical Performance Evaluation of Distributed SQL Query Engines“. In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31 - February 4, 2015. 2015, pp. 123–131 (cit. on pp. 121, 127, 128).

Bibliography 261 [Wu+16] Dongyao Wu, Liming Zhu, Xiwei Xu, et al. „Building Pipelines for Heteroge- neous Execution Environments for Big Data Processing“. In: IEEE Softw. (2016) (cit. on p. 215).

[WYY15] Lengdong Wu, Li-Yan Yuan, and Jia-Huai You. „Survey of Large-Scale Data Management Systems for Big Data Applications“. In: J. Comput. Sci. Technol. 30.1 (2015), pp. 163–183 (cit. on p. 26).

[Xin+13] Reynold S. Xin, Josh Rosen, Matei Zaharia, et al. „Shark: SQL and rich analytics at scale“. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013. 2013, pp. 13–24 (cit. on p. 36).

[Xio+16] Wen Xiong, Zhibin Yu, Lieven Eeckhout, et al. „ShenZhen transportation system (SZTS): a novel big data benchmark suite“. In: The Journal of Supercomputing 72.11 (2016), pp. 4337–4364 (cit. on p. 285).

[XNR14] Miguel Gomes Xavier, Marcelo Veiga Neves, and César Augusto Fonticielha De Rose. „A Performance Comparison of Container-Based Virtualization Systems for MapReduce Clusters“. In: 22nd Euromicro International Conference on Paral- lel, Distributed, and Network-Based Processing, PDP 2014, Torino, Italy, February 12-14, 2014. 2014, pp. 299–306 (cit. on p. 30).

[Xu+17] Zhen Xu, Xuhao Chen, Jie Shen, et al. „GARDENIA: A Domain-specific Bench- mark Suite for Next-generation Accelerators“. In: CoRR abs/1708.04567 (2017). arXiv: 1708.04567 (cit. on p. 297).

[Yah18a] Yahoo. 2018. URL: https://github.com/brianfrankcooper/YCSB (cit. on p. 273).

[Yah18b] Yahoo. 2018. URL: https://github.com/yahoo/streaming- benchmarks (cit. on p. 287).

[Yah18c] Yahoo. 2018. URL: https://yahooeng.tumblr.com/post/135321837876/ benchmarking-streaming-computation-engines-at (cit. on p. 287).

[Yan13] Yanpei Chen. Statistical Workload Injector for MapReduce (SWIM). 2013. URL: https://github.com/SWIMProjectUCB/SWIM/wiki (cit. on p. 277).

[Ye+12] Kejiang Ye, Xiaohong Jiang, Yanzhang He, et al. „vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration“. In: 2012 IEEE International Conference on Cluster Computing Workshops, CLUSTER Workshops 2012, Beijing, China, September 24-28, 2012. 2012, pp. 152–160 (cit. on pp. 47, 64).

[Zah+10] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. „Spark: Cluster computing with working sets.“ In: HotCloud 10.10- 10 (2010), p. 95 (cit. on pp. 31, 33, 123).

[Zah+12a] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. „Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing“. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012. 2012, pp. 15– 28 (cit. on pp. 31, 33, 105, 106).

262 Bibliography [Zah+12b] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. „Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing“. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI. 2012, pp. 15–28 (cit. on p. 178).

[Zah+13] Matei Zaharia, Tathagata Das, Haoyuan Li, et al. „Discretized streams: fault- tolerant streaming computation at scale“. In: ACM SIGOPS 24th SOSP ’13, Farmington, PA, USA, November 3-6, 2013. 2013 (cit. on pp. 34, 192).

[Zah16] Zaharia. 2016. URL: databricks . com / blog / 2016 / 07 / 28 / structured - streaming-in-apache-spark.html (cit. on pp. 205, 206).

[ZCC17] Yunhao Zhang, Rong Chen, and Haibo Chen. „Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data“. In: 26th SOSP, Shanghai, China, Oct. 28-31, 2017. 2017 (cit. on pp. 208, 210).

[ZE+11] Paul Zikopoulos, Chris Eaton, et al. Understanding big data: Analytics for en- terprise class hadoop and streaming data. McGraw-Hill Osborne Media, 2011 (cit. on pp. v, 1).

[Zha+18] Qingchen Zhang, Laurence T. Yang, Zhikui Chen, and Peng Li. „A survey on deep learning for big data“. In: Information Fusion 42 (2018), pp. 146–157 (cit. on pp. v, 1).

[ZHZ16] Jianfeng Zhan, Rui Han, and Roberto V. Zicari, eds. Big Data Benchmarks, Performance Optimization, and Emerging Hardware. Vol. 9495. Lecture Notes in Computer Science. Springer, 2016 (cit. on p. 165).

[Zic14] Roberto V Zicari. „Big data: Challenges and opportunities“. In: Big data com- puting 564 (2014) (cit. on p. 39).

Bibliography 263

Appendices

265

Classification of Big Data A Benchmarks

This section of the Appendix looks at the current standardized and in-development Big Data Benchmarks. It starts with an overview of the major benchmark organiza- tions, followed by classification of the benchmarks and finally short description of the existing Big Data benchmark platforms are reviewed. The benchmark classification is divided in six categories according to its workload, data type and targeted Big Data technologies as depicted on Figure A.1. These categories are mirco-benchmarks, Big Data and SQL-on-Hadoop benchmarks, streaming benchmarks, machine learning and deep learning benchmarks, graph benchmarks and new emerg- ing benchmarks. Additionally, the relevant benchmark reviews are divided in six subsections: Description, Benchmark type and domain, Workload, Data type and Generation, Metrics and Implementation and technology stack.

The contributions in this chapter are from the following publications:

• Todor Ivanov and Roberto V. Zicari, Analytics Benchmarks, in Springer En- cyclopedia of Big Data Technologies 2018, Living Edition, Editors: Sherif Sakr, Albert Zomaya.

• Todor Ivanov, Tilmann Rabl, Meikel Poess, Anna Queralt, John Poelman, Nico- las Poggi, Jeffrey Buell, Big Data Benchmark Compendium, in Proceedings of the 7th TPC Technology Conference on Performance Evaluation & Bench- marking (TPCTC 2015), August 31, 2015, Kohala Coast, Hawai, [Iva+15c].

• Raik Niemann and Todor Ivanov, Evaluating the Energy Efficiency of Data Management Systems, in Proceedings of the 4th IEEE/ACM International Workshop on Green and Sustainable Software (GREENS 2015), Florence, Italy, May 18, 2015.

• Deliverable 1.3 - Big Data Methodologies, Tools and Infrastructures (2018) - Kim Hee, Todor Ivanov, Roberto V. Zicari, Rut Waldenfels, Hevin Özmen, Naveed Mushtaq, Minsung Hong, Tharsis Teoh, Rajendra Akerkar - EU Project "Leveraging Big Data to Manage Transport Operations" (LeMO), H2020, Project ID: 770038

• Deliverable 1.1 - Industry Requirements with benchmark metrics and KPIs (2018) - Barbara Pernici, Chiara Francalanci, Angela Geronazzo, Lucia Polidori, Gabriella Cattaneo, Helena Schwenk, Marko Grobelnik, Tomás Pariente, Iván Martínez, Todor Ivanov, Arne Berre - EU Project "Evidence Based Big Data Bench- marking to Improve Business Performance" (DataBench), H2020, Project ID: 780966

267 Fig. A.1.: Big Data Benchmarks Classification

A.1 Benchmark Organizations

Transaction Processing Performance Council (TPC)

The TPC (Transaction Processing Performance Council) [TPC18a] is a non-profit corporation operating as an industry consortium of vendors that define transaction processing, database and big data system benchmarks. TPC was formed on August 10, 1988 by eight companies convinced by Omri Serlin [TPC18a]. In November 1989 was published the first standard benchmark TPC-A with 42-pages specification [Gra92]. By late 1990, there were 35 member companies. As of 2018, TPC has 21 company members and three associate members. There are six obsolete benchmarks (TPC-A, TPC-App, TPC-B, TPC-D, TPC-R and TPC-W), 14 active benchmarks TPC-C [Raa93], TPC-E [Hog09], TPC-H [PF00], TPC-DS [PRJ17; PNW07; NP06], TPC- DI [Poe+14], TPC-V [ST10], TPCx-HS [Nam14], TPCx-BB [Gha+13b] and two common specifications (Pricing and Energy) used across all benchmarks. Table A.1 lists the active TPC benchmarks grouped by domain.

268 Chapter A Classification of Big Data Benchmarks Tab. A.1.: Active TPC Benchmarks [TPC18a]

Benchmark Domain Specification Name Transaction Processing (OLTP) TPC-C, TPC-E Decision Support (OLAP) TPC-H, TPC-DS, TPC-DI Virtualization TPC-VMS, TPCx-V, TPCx-HCI Big Data TPCx-HS V1, TPCx-HS V2, TPCx-BB, TPC-DS V2 IoT TPCx-IoT Common Specifications TPC-Pricing, TPC-Energy

Tab. A.2.: Active SPEC Benchmarks [SPE18]

Benchmark Domain Specification Name Cloud SPEC Cloud IaaS 2016 CPU SPEC CPU2006, SPEC CPU2017 Graphics and Workstation Per- SPECapc for SolidWorks 2015, SPECapc for formance Siemens NX 9.0 and 10.0, SPECapc for PTC Creo 3.0, SPECapc for 3ds Max 2015, SPECwpc V2.1, SPECviewperf 12.1 High Performance Comput- SPEC OMP2012, SPEC MPI2007, SPEC ACCEL ing, OpenMP, MPI, OpenACC, OpenCL Java Client/Server SPECjvm2008, SPECjms2007, SPECjEnter- prise2010, SPECjbb2015 Storage SPEC SFS2014 Power SPECpower ssj2008 Virtualization SPEC VIRT SC 2013

Standard Performance Evaluation Corporation (SPEC)

The SPEC (Standard Performance Evaluation Corporation) [SPE18] is a non-profit corporation formed to establish, maintain and endorse standardized benchmarks and tools to evaluate performance and energy efficiency for the newest generation of computing systems. It was founded in 1988 by a small number of workstation vendors. The SPEC organization is umbrella organization that covers four groups (each with their own benchmark suites, rules and dues structure): the Open Systems Group (OSG), the High-Performance Group (HPG), the Graphics and Workstation Performance Group (GWPG) and the SPEC Research Group (RG). As of 2018, there are around 19 active SPEC benchmarks listed in Table A.2.

A.1 Benchmark Organizations 269 Securities Technology Analysis Center (STAC)

The STAC Benchmark Council [STA18] consists of over 300 financial institutions and more than 50 vendor organizations whose purpose is to explore technical challenges and solutions in financial services and to develop technology benchmark standards that are useful to financial organizations. Since 2007, the council is working on benchmarks targeting Fast Data, Big Data and Big Compute workloads in the finance industry. As of 2018, there are around 11 active benchmarks listed in Table A.3.

Tab. A.3.: Active STAC Benchmarks [STA18]

Benchmark Domain Specification Name Feed handlers STAC-M1 Data distribution STAC-M2 Tick analytics STAC-M3 Event processing STAC-A1 Risk computation STAC-A2 Backtesting STAC-A3 Trade execution STAC-E Tick-to-trade STAC-T1 Time sync STAC-TS Big Data in-development Network I/O STAC-N1, STAC-T0

Linked Data Benchmark Council (LDBC)

The Linked Data Benchmark Council (LDBC) [LDB18] is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management software. As of 2018, there are three standardized benchmarks listed with more details in Table A.4 and 9 active member companies and organizations.

Other historical benchmark organizations and consortia are The Perfect Club [Gra92; Hoc96] and the Parkbench Committee [Hoc96].

A.2 Micro-Benchmarks

Hadoop Workload Examples

Since its first version the Hadoop framework has included several ready to use MapReduce sample applications. They are located in the hadoop-examples-version.jar jar file. These applications are commonly used to both learn and benchmark Hadoop.

270 Chapter A Classification of Big Data Benchmarks Tab. A.4.: Active LDBC Benchmarks [LDB18]

Benchmarks Workload Description Graphalytics benchmark breadth-first search, PageRank, weakly connected com- ponents, community detection using label propagation, local clustering coefficient, and single-source shortest paths Semantic Publishing testing the performance of RDF engines inspired by the Benchmark (SPB) Media/Publishing industry Social Network Bench- Interactive Workload mark Business Intelligence Workload Graph Analytics Workload

The most popular ones include: WordCount, Grep, Pi, and Terasort. The Hibench suite, which is briefly described in the next sub-section, also includes these example workloads.

• Grep Task: Grep [Apa09] is a standard MapReduce program that is included in the major Hadoop distributions. The program extracts strings from text input files, matches regular expressions against those strings and counts their number of occurrences. More precisely it consists of two MapReduce jobs running in sequence. The first job counts how many times a matching string occurred, and the second job sorts the matching strings by their frequency and stores the output in a single output file.

• Pi [Apa15b] is a MapReduce program computing the exact binary digits of the mathematical constant Pi. It uses multiple map tasks to do the computation and a single reducer to gather the results of the mappers. Therefore, the application is more CPU bound and produces very little network and storage I/O.

HiBench

Description: HiBench [Hua+10a; Int18b] is a comprehensive big data benchmark suite for evaluating different big data frameworks. It consists of 19 workloads including both synthetic micro-benchmarks and real-world applications from 6 categories which are micro, ml (machine learning), sql, graph, websearch and streaming.

Benchmark type and domain: Micro-benchmark suite including 6 categories which are micro, ml (machine learning), sql, graph, websearch and streaming.

Workloads:

• Micro Benchmarks: Sort (sort), WordCount (wordcount), TeraSort (terasort), Sleep (sleep), enhanced DFSIO (dfsioe)

A.2 Micro-Benchmarks 271 • SQL: Scan (scan), Join(join), Aggregate(aggregation)

• Websearch Benchmarks: PageRank (pagerank), Nutch indexing (nutchindex- ing)

• Graph Benchmark: NWeight (nweight)

• Streaming Benchmarks: Identity (identity), Repartition (repartition), Stateful Wordcount (wordcount), Fixwindow (fixwindow)

Data type and Generation: Most workloads use synthetic data generated from real data samples. The workloads use structured and semi-structured data.

Metric: The measured metrics are execution time (latency), throughput and system resource utilizations (CPU, Memory, etc.).

Implementation and technology stack: HiBench can be executed in Docker con- tainers. It is implemented using the following technologies: (1) Hadoop: Apache Hadoop 2.x, CDH5, HDP; (2) Spark: Spark 1.6.x, Spark 2.0.x, Spark 2.1.x, Spark 2.2.x; (3) Flink: 1.0.3; (4) Storm: 1.0.1; (5) Gearpump: 0.8.1; and (6) Kafka: 0.8.2.2.

SparkBench

Description: SparkBench [Li+17; Agr+15; IBM18] is a flexible system for bench- marking and simulating Spark jobs. It consists of multiple workloads organized in 4 categories.

Benchmark type and domain: SparkBench is a Spark specific benchmarking suite to help developers and researchers to evaluate and analyze the performance of their systems in order to optimize the configurations. It consists of 10 workloads organized in 4 different categories.

Workloads: The atomic unit of organization in Spark-Bench is the workload. Work- loads are standalone Spark jobs that read their input data, if any, from disk, and write their output, if the user wants it, out to disk. Workload suites are collections of one or more workloads. The workloads in a suite can be run serially or in parallel. The 4 categories of workloads are:

• Machine Learning: logistic regression (LogRes), support vector machine (SVM) and matrix factorization (MF).

• Graph Computation: PageRank, collaborative filtering model (SVD++) and a fundamental graph analytics algorithm (TriangleCount (TC)).

• SQL Query: select, aggregate and join in HiveQL and RDDRelation.

• Streaming Application: Twitter popular tag and PageView

272 Chapter A Classification of Big Data Benchmarks Data type and Generation: The data type and generation is depending on the different workload. The LogRes and SVM use the Wikipedia data set. The MF, SVD++ and TriangleCount use the Amazon Movie Review data set. The PageRank uses Google Web Graph data and respectively Twitter uses Twitter data. The SQL Queries workloads use E-commerce data. Finally, the PageView uses PageView DataGen to generate synthetic data.

Metrics: SparkBench defines a number of metrics facilitating users to compare be- tween various Spark optimizations, configurations and cluster provisioning options: (1) Job Execution Time(s) of each workload; (2) Data Process Rate (MB/seconds); and (3) Shuffle Data Size.

Implementation and technology stack: Spark-Bench is currently compiled against the Spark 2.1.1 jars and should work with Spark 2.x. It is written using Scala 2.11.8.

GridMix

GridMix [Apa13a] is a benchmark suite for Hadoop clusters, which consists of a mix of synthetic jobs. The benchmark suite emulates different users sharing the same cluster resources and submitting different types and number of jobs. This includes also the emulation of distributed cache loads, compression, decompression, and job configuration in terms of resource usage. In order to run the GridMix benchmark a trace describing the mix of all running MapReduce jobs in the given cluster has to be recorded.

Yahoo! Cloud Serving Benchmark (YCSB)

Description: The YCSB [Coo+10; Yah18a] framework is designed to evaluate the performance of different “key-value” and “cloud” serving systems, which do not support the ACID properties. The benchmark is open source and available on GitHub. The YCSB++ [Pat+11] , an extension of the YCSB framework, includes many additions such as multi-tester coordination for increased load and eventual consistency measurement, multi-phase workloads to quantify the consequences of work deferment and the benefits of anticipatory configuration optimization such as B-tree pre-splitting or bulk loading, and abstract APIs for explicit incorporation of advanced features in benchmark tests.

Benchmark type and domain: The framework is a collection of cloud OLTP related workloads representing a particular mix of read/write operations, data sizes, request distributions, and similar that can be used to evaluate systems at one particular point in the performance space.

Workload: YCSB provides a core package of 6 pre-defined workloads A-F, which simulate a cloud OLTP applications. The workloads are a variation of the same basic application type and using a table of records with predefined size and type of the fields. Each operation against the data store is randomly chosen to be one of:

A.2 Micro-Benchmarks 273 • Insert: insert a new record.

• Update: update a record by replacing the value of one field.

• Read: read a record, either one randomly chosen field or all fields.

• Scan: scan records in order, starting at a randomly chosen record key. The number of records to scan is randomly chosen.

The YCSB workload consists of random operations defined by one of the several built-in distributions:

• Uniform: choose an item uniformly at random.

• Zipfian: choose an item according to the Zipfian distribution.

• Latest: like the Zipfian distribution, except that the most recently inserted records are in the head of the distribution.

• Multinomial: probabilities for each item can be specified.

Data type and Generation: The benchmark consists of a workload generator and a generic database interface, which can be easily extended to support other relational or NoSQL databases.

Metrics: The benchmark measures the latency and achieved throughput of the executed operations. At the end of the experiment, it reports total execution time, the average throughput, 95th and 99th percentile latencies, and either a histogram or time series of the latencies.

Implementation and technology stack: Currently, YCSB is implemented and can be run with more than 14 different engines like Cassandra, HBase, MongoDB, Riak, Couchbase, Redis, Memcached, etc. The YCSB Client is a Java program for generating the data to be loaded to the database, and generating the operations which make up the workload.

TPCx-HS v1 & v2

Description: The TPCx-HS was released in July 2014 as the first industry’s standard benchmark for Big Data systems [Nam+14]. The updated TPCx-HS v2 was released in April 2017.

Benchmark type and domain: The benchmark is based on the TeraSort work- load [Apa15a], which is part of the Apache Hadoop distribution.

Workload: Similarly, it consists of four modules: HSGen, HSDataCkeck, HSSort, and HSValidate. The HSGen is a program that generates the data for a particular Scale Factor (see Clause 4.1 from the TPCx-HS specification) and is based on the TeraGen,

274 Chapter A Classification of Big Data Benchmarks which uses a random data generator. The HSDataCheck is a program that checks the compliance of the dataset and replication. The HSSort is a program, based on TeraSort, which sorts the data into a total order. Finally, HSValidate is a program, based on TeraValidate, that validates the output is sorted.

Phase Description as provided in TPCx-HS specification [Tra15] 1 Generation of input data via HSGen. The data generated must be replicated 3-ways and written on a durable medium. 2 Dataset (See Clause 4) verification via HSDataCheck. The program is to verify the cardinality, size, and replication factor of the gener- ated data. If the HSDataCheck program reports failure then the run is considered invalid. 3 Running the sort using HSSort on the input data. This phase samples the input data and sorts the data. The sorted data must be replicated 3-ways and written on a durable medium. 4 Dataset (See Clause 4) verification via HSDataCheck. The program is to verify the cardinality, size and replication factor of the sorted data. If the HSDataCheck program reports failure then the run is considered invalid. 5 Validating the sorted output data via HSValidate. HSValidate vali- dates the sorted data. If the HSValidate program reports that the HSSort did not generate the correct sort order, then the run is considered invalid. Tab. A.5.: TPCx-HS Phases

A valid benchmark execution consists of five separate phases which have to be run sequentially to avoid any phase overlapping. Additionally, Table A.5 provides the exact description of each of the execution phases. The benchmark is started by the script and consists of two consecutive runs, Run1 and Run2. No activities except file system cleanup are allowed between Run1 and Run2. The completion times of each phase/module (HSGen, HSSort and HSValidate) except HSDataCheck are currently reported.

An important requirement of the benchmark is to maintain 3-way data replication throughout the entire experiment.

Data type and Generation: The scale factor defines the size of the dataset, which is generated by HSGen and used for the benchmark experiments. In TPCx-HS, it follows a stepped size model.

Metric: The benchmark reports the total elapsed time (T) in seconds for both runs. This time is used for the calculation of the TPCx-HS performance metric also abbreviated with HSph@SF. The run that takes more time and results in lower TPCx-HS performance metric is defined as the performance run. On the contrary, the run that takes less time and results in TPCx-HS performance metric is defined as the repeatability run. The benchmark reported performance metric is the TPCx-HS performance metric for the performance run.

A.2 Micro-Benchmarks 275 Implementation and technology stack: It stresses both the hardware and soft- ware components including the Hadoop run-time stack, Hadoop File System, and MapReduce layers.

TPCx-IoT

Description: The TPC Benchmark IoT (TPCx-IoT) benchmark [NP15; Pös17; Nam18; TPC18c] workload is designed based on Yahoo Cloud Serving Benchmark (YCSB). It is not comparable to YCSB due to significant changes. The TPCx-IoT workloads consists of data ingestion and concurrent queries simulating workloads on typical IoT Gateway systems. The dataset represents data from sensors from electric power station(s).

Benchmark type and domain: TPCx-IoT was developed to provide the industry with an objective measure of the hardware, operating system, data storage and data management systems for IoT Gateway systems. The TPCx-IoT benchmark models a continuous system availability of 24 hours a day, 7 days a week.

Workload: The System Under Test (SUT) must run a data management platform that is commercially available and data must be persisted in a non-volatile durable media with a minimum of two-way replication. The workload represents data injected into the SUT with analytics queries in the background. The analytic queries retrieve the readings of a randomly selected sensor for two 30 second time intervals, TI1 and TI2. The first time interval TI1 is defined between the timestamp the query was started Ts and the timestamp 5 seconds prior to Ts , i.e. TI1 =[Ts-5,Ts]. The second time interval is a randomly selected 5 seconds time interval TI2 within the 1800 seconds time interval prior to the start of the first query, Ts-5. If Ts <=1810, prior to the start of the first query, Ts-5.

Data type and Generation: Each record generated consists of driver system id, sensor name, time stamp, sensor reading and padding to a 1 Kbyte size. The driver system id represents a power station. The dataset represents data from 200 different types of sensors.

Metrics: TPCx-IoT was specifically designed to provide verifiable performance, price-performance and availability metrics for commercially available systems that typically ingest massive amounts of data from large numbers of devices. TPCx-IoT defines the following primary metrics: (1) IoTps as the performance metric; (2) $/IoTps as the price-performance metric; and (3) system availability date.

Implementation and technology stack: The benchmark currently supports the HBase 1.2.1 and Couchbase-Server 5.0 NoSQL databases. A guide providing instruc- tions on how to add new databases is also available.

276 Chapter A Classification of Big Data Benchmarks PigMix

PigMix/PigMix2 [Apa13c] is a set of 17 queries specifically created to test the performance of Pig systems. Specifically, it tests the latency and scalability of Pig systems. The queries, written in Pig Latin [Ols+08a], test different operations like data loading, different types of joins, group by clauses, sort clauses, as well as aggregation operations. The benchmark includes eight data sets, with varying schema attributes and sizes, generated using the DataGeneratorHadoop [Apa10] tool. PigMix/PigMix2 are not considered true benchmarks as they lack some of the main benchmark elements, such as metrics.

PUMA Benchmarks

Description: PUMA benchmarks [Uni12; Ahm+12] are a benchmark suite which represents a broad range of MapReduce applications exhibiting application charac- teristics with high/low computation and high/low shuffle volumes. There are a total of 13 micro-benchmarks.

Benchmark type and domain: The benchmark suite focuses on stress testing the MapReduce framework with typical micro applications processing mostly text data.

Workload: There are 13 micro-benchmarks, out of which TeraSort, WordCount, and Grep are from Hadoop distribution. They are slightly modified to take number of re- duce tasks as input from the user and generate final time completion statistics of jobs. The remaining 10 workloads are Inverted-Index, Term-Vector, Self-Join, Adjacency- List, K-Means, Classification, Histogram-Movies, Histogram-Ratings, Sequence-Count and Ranked-Inverted-Index.

Data type and Generation: Each of the workloads use a different input dataset with pre-defined size and specific structure.

Metric: The micro-benchmarks report execution time and relevant MapReduce job statistics.

Implementation and technology stack: PUMA benchmark suite is implemented for Apache Hadoop (MapReduce and Hadoop File System).

Statistical Workload Injector for MapReduce (SWIM)

SWIM [Che+11; CAK12; Yan13] is a benchmark, which takes a different approach in the testing process. It consists of a framework, which is able to synthesize representative workload from real MapReduce traces taking into account the job submit time, input data size, and shuffle/input and output/shuffle data ratio. The result is a synthetic workload, which has the exact characteristics of the original workload. Similarly, the benchmark generates artificial data. Then the workload executor runs a script which takes the input data and executes the synthetically

A.2 Micro-Benchmarks 277 generated workload (jobs with specified data size, data ratios, and simulating gabs between the job executions). Additionally, the reproduced workload includes a mix of job submission rates and sequences and a mix of common job types. Currently, the benchmark includes multiple real Facebook traces and the goal is to further extend the repository by including new real workload traces.

A.3 Big Data and SQL-on-Hadoop Benchmarks

AdBench

AdBench [Bha16] is an end-to-end data pipeline benchmark that combines Ad- Serving, Streaming Analytics on Ad-serving logs, streaming ingestion and updates of various data entities, batch-oriented analytics (e.g. for Billing), Ad-Hoc analytical queries, and Machine learning for Ad targeting. While this benchmark is specific to modern Web or Mobile advertising companies and exchanges, the workload characteristics are found in many verticals, such as Internet of Things (IoT), financial services, retail, and healthcare.

AMP Lab Big Data Benchmark

AMP Lab Benchmark [AMP13] measures the analytical capabilities of data ware- housing solutions. This benchmark currently provides quantitative and qualitative comparisons of five data warehouse systems: RedShift, Hive, Stinger/Tez, Shark, and Impala. Based on Pavlo’s Benchmark [Pav+09b; Sto+10] and HiBench [Hua+10a; Int18b], it consists of four queries involving scans, aggregations, joins, and UDFs. It supports different data sizes and scaling to thousands of nodes.

BigBench (TPCx-BB)

Description: BigBench [Gha+13b; Bar+14; Int18a; TPC18b] is an end-to-end big data benchmark that represents a data model simulating the volume, velocity and variety characteristics of a big data system, together with a synthetic data generator for structured, semi-structured and unstructured data. The structured part of the retail data model is adopted from the TPC-DS benchmark and further extended with semi-structured (registered and guest user clicks) and unstructured data (product reviews). In 2016, BigBench was standardized as TPCx-BB by the Transaction Processing Performance Council (TPC).

Benchmark type and domain: BigBench is an end-to-end, technology agnostic, application-level benchmark that tests the analytical capabilities of a Big Data platform. It is based on a fictional product retailer business model.

Workload: The business model and a large portion of the data model’s structured part is derived from the TPC-DS benchmark. The structured part was extended with

278 Chapter A Classification of Big Data Benchmarks a table for the prices of the retailer’s competitors, the semi-structured part was added represented by a table with website logs and the unstructured part was added by a table showing product reviews. The simulated workload is based on a set of 30 queries covering the different aspects of big data analytics proposed by McKinsey.

Data type and Generation: The data generator can scale the amount of data based on a scale factor. Due to parallel processing of the data generator, it runs efficiently for large scale factors. The benchmark consists of four key steps: (i) System setup; (ii) Data generation; (iii) Data load; and (iv) Execute application workload.

Metrics: TPCx-BB defines the following primary metrics: (1) BBQpm@SF, the performance metric, reflecting the TPCx-BB Queries per minute throughput; where SF is the Scale Factor; (2) $/BBQpm@SF, the price/performance metric; and (3) System Availability Date as defined by the TPC Pricing Specification.

Implementation and technology stack: Since the BigBench specification is general and technology agnostic, it should be implemented specifically for each Big Data system. The initial implementation of BigBench was made for the Teradata Aster platform. It was done in the Aster’s SQL-MR syntax served - additionally to a de- scription in the English language - as an initial specification of BigBench’s workloads. Meanwhile, BigBench is implemented for Hadoop, using the MapReduce engine and other components like Hive, Mahout, Spark SQL, Spakr MLlib and OpenNLP from the Hadoop Ecosystem.

BigDataBench

BigDataBench [Wan+14b] is an open source Big Data benchmark suite [ICT15] consisting of 14 data sets and 33 workloads. Six of the 14 data sets are real-world based, generated using the BDGS [Min+13] data generator. The generated data types include text, graph, and table data, and are fully scalable. According to the literature it is unclear of what the upper bound of the data set sizes are. The remaining eight data sets are generated from a small seed of real data and are not scalable yet. The 33 workloads are divided into five common application domains: search engine, social networks, electronic commerce, multimedia analytics, and bioinformatics. BigDataBench has many similarities with the DCBench [ICT13b], a benchmark suite developed to test data center workloads. This is a rapidly evolving benchmark. Please check the official website for current updates.

BigFrame

BigFrame [KKB14] is a benchmark generator offering a benchmarking-as-a-service solution for Big Data analytics. While the latest version together with documentation is available on GitHub [Big13], changes are still being made to the benchmark generator. The benchmark distinguishes between two different analytics workload, 1) offline-analytics and 2) real-time analytics. It consists of structured data (Sales, Item, Customer and Promotion tables) adapted from the TPC-DS benchmark and semi-structured JSON data types containing unstructured text. The current version

A.3 Big Data and SQL-on-Hadoop Benchmarks 279 of the benchmark provides data models for two types of workloads: historical and continuous query. The data in the historical workflow is processed at typical data warehouse rates, e.g., week, whereas the continuous workflow is processed in real- time. It enables real-time decision making based on instant sales and user feedback updates. The development of mixed workloads combining relational, text and graph data is also in progress.

BigFUN

BigFUN [PCW15; Pir] is based on a social network use case with synthetic semi- structured data in JSON format. The benchmark focuses exclusively on micro- operation level and consists of queries with various operations such as simple retrieves, range scans, aggregations, joins, as well as inserts and updates. The current implementation supports AsterixDB, MongoDB and Hive.

CloudRank-D

CloudRank-D [Luo+12b; ICT13a] is a benchmark suite for evaluating the perfor- mance of cloud computing systems running Big Data applications. The suite consists of 13 representative data analysis tools, which are designed to address a diverse set of workload data and computation characteristics (i.e., data semantics, data models, and data sizes, the ratio of the size of data input to that of data output). Table A.6 depicts the representative applications along with its workload type. The benchmark suite reports two complimentary metrics: data processed per second (DPS) and data processed per Joule (DPJ). DPS is defined as the total amount of data inputs of all jobs divided by the total running time from the submission time of the first job to the end time of the last job. The DPJ is defined as the total amount of data inputs of all jobs divided by the total energy consumed during the duration from the submission time of the first job to the end time of the last job.

CloudSuite

CloudSuite [Fer+12b] is a benchmark suite consisting of both emerging scale-out workloads and traditional benchmarks. The goal of the benchmark suite is to analyze and identify key inefficiencies in the processor’s core micro-architecture and memory system organization when running today’s cloud workloads. Table A.7 summarizes the workload categories as well as the applications that were actually benchmarked.

MRBench

MRBench [Kim+08] is a benchmark evaluating the processing of business oriented queries and concurrent data modifications on MapReduce systems. It implements

280 Chapter A Classification of Big Data Benchmarks Category No Workload 1 Sort Basic Operations 2 WordCount 3 Grep 4 Naive bayes Classification 5 Support vector machine Clustering 6 K-means Recommendation 7 Item based collaborative filtering Association rule mining 8 Frequent pattern growth Sequence learning 9 Hidden Markov 10 Grep select 11 Ranking select Data warehouse operations 12 User-visits aggregation 13 User-visits ranking join Tab. A.6.: Representative applications in CloudRank-D; Adopted from [Luo+12b]

Category Application Data Serving Cassandra 0.7.3 with YCSB 0.1.3 MapReduce Bayesian classification from Mahout 0.4 lib Media Streaming Darwin Streaming Server 6.0.3 with Faban Driver SAT Solver Klee SAT Solver Web Frontend Olio, Nginx and CloudStone Web Search Nutch 1.2/Lucene 3.0.1 Web Backend MySQL 5.5.9 Traditional Benchmarks PARSEC 2.1, SPEC CINT2006, SPECweb09, TPC-C, TPC-E Tab. A.7.: Applications in CloudSuite; Adopted from [Fer+12b]

the 22 queries of the TPC-H decision support system benchmark directly in map and reduce operations. The MRBench supports three configuration options: database size and number of map and reduce tasks.

MapReduce Benchmark Suite (MRBS)

MRBS [SSB12a; SSB12c; MRB13] is a comprehensive benchmark suite for evaluating the performance of MapReduce systems. It covers five application domains listed in Table A.8. The high-level metrics reported by the benchmark are client request latency, throughput and cost. Additionally, low-level metrics like size of read/written data, throughput of MR jobs, and tasks are also reported. The MRBS implements a service that provides different types of operations, which can be requested by clients. Two execution modes are supported: interactive mode and batch mode. The

A.3 Big Data and SQL-on-Hadoop Benchmarks 281 benchmark run consists of three phases dynamically configurable by the end-user: warm-up phase, run-time phase, and slow-down phase. The user can specify the number of runs and the different aspects of load: dataload and workload. The dataload is characterized by the size and the nature of the data sets used as inputs for a benchmark, and the workload is characterized by the number of concurrent clients and the distribution of the request type.

Domain Application Recommendation Benchmark based on real movie database Business Intelligence TPC-H Bioinformatics DNA sequencing Text Processing Search patterns, word occurrence and sorting on ran- domly generated text files Data Mining Classifying newsgroup documents into categories, canopy clustering operations Tab. A.8.: Representative Applications in MRBS

Pavlo’s Benchmark (CALDA)

Pavlo’s Benchmark [Pav+09a; Sto+10; And11] consists of five tasks defined as SQL queries among which is the original MapReduce Grep task, which is a representative of most real user MapReduce programs. The benchmark was developed to specifically compare the capabilities of Hadoop with those of commercial parallel Relational Database Management Systems (RDBMS). Although the reported results do not favor the Hadoop platform, the authors remain optimistic that MapReduce systems will coexist with traditional database systems. Table A.9 summarizes all types of tasks in Pavlo’s Benchmark and their complimentary SQL statements.

PRIMEBALL

PRIMEBALL [Fer+13b] is a novel and unified benchmark specification for comparing the parallel processing frameworks in the context of Big Data applications hosted in the cloud. It is implementation- and technology-agnostic, using a fictional news hub called New Pork Times, based on a popular real-life news site. Included are various use-case scenarios made of both queries and data-intensive batch processing. The raw data set is fetched by a crawler and consists of both structured XML and binary audio and video files, which can be scaled by a pre-defined scale factor (SF) to 1 PB.

The benchmark specifies two main metrics: throughput and price performance. The throughput metric reports the total time required to execute a particular scenario. The price performance metric is equal to the throughput divided by the price, where the price is defined by the specific cloud provider and depends on multiple factors. Additionally, the benchmark specifies several relevant properties characterizing

282 Chapter A Classification of Big Data Benchmarks Category No Workload/SQL Query General task 1 SELECT * FROM Data WHERE field LIKE ’%XYZ%’; PageRank/Selection Task 2 SELECT pageURL, pageRank FROM Rankings WHERE pageRank >X; SELECT sourceIP, SUM(adRevenue) Web Log/Aggregation Task 3 FROM UserVisits GROUP BY sourceIP; SELECT SUBSTR(sourceIP,1,7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); Join Task 4 SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; UDF Aggregation Task 5 SELECT INTO Temp F(contents) FROM Docu- ments; SELECT url, SUM(value) FROM Temp GROUP BY url; Tab. A.9.: Pavlo’s Benchmark Queries

cloud platforms, such as 1) scale-up; 2) elastic speedup; 3) horizontal scalability; 4) latency; 5) durability; 6) consistency and version handling; 7) availability; 8) concurrency and other data and information retrieval properties.

TPC-H

TPC-H [18ah] is the de facto benchmark standard for testing data warehouse ca- pability of a system. Instead of representing the activity of any particular business segment, TPC-H models any industry that manages, sells, or distributes products worldwide (e.g., car rental, food distribution, parts, suppliers, etc.). The benchmark is technology-agnostic. The purpose of TPC-H is to reduce the diversity of operations found in a typical data warehouse application, while retaining the application’s essential performance characteristics, namely: the level of system utilization and the complexity of operations. The core of the benchmark is comprised of a set of 22 business queries designed to exercise system functionalities in a manner represen- tative of complex decision support applications. These queries have been given a realistic context, portraying the activity of a wholesale supplier to help the audience relate intuitively to the components of the benchmarks. It also contains two refresh

A.3 Big Data and SQL-on-Hadoop Benchmarks 283 functions (RF1, RF2) modeling the loading of new sales information (RF1) and the purging of stale or obsolete sales information (RF2) from the database. The exact definition of the workload can be found in the latest specification [18ah]. It was adapted very early in the development of Hive [Apa13b; Apa15c] and Pig [Apa12], and implementations of the benchmark are available for both. In order to pub- lish a TPC-H compliant performance result the system needs to support full ACID (Atomicity, Consistency, Isolation, and Durability).

TPC-DS v1

TPC-DS [18af] is a decision support benchmark that models several generally appli- cable aspects of a decision support system, including queries and data maintenance. It takes the marvels of TPC-H and, now obsolete TPC-R, and fuses them into a modern DSS benchmark. The main focus areas:

• Multiple snowflake schemas with shared dimensions

• 24 tables with an average of 18 columns

• 99 distinct SQL-99 queries with random substitutions

• More representative skewed database content

• Sub-linear scaling of non-fact tables

• Ad-hoc, reporting, iterative and extraction queries

• ETL-like data maintenance

While TPC-DS may be applied to any industry that must transform operational and external data into business intelligence, the workload has been granted a realistic context. It models the decision support tasks of a typical retail product supplier. The goal of selecting a retail business model is to assist the reader in relating intuitively to the components of the benchmark, without tracking that industry segment so tightly as to minimize the relevance of the benchmark. The schema, an aggregate of multiple star schemas, contains essential business information, such as detailed customer, order, and product data for the classic sales channels: store, catalog, and Internet. Wherever possible, real world data are used to populate each table with common data skews, such as seasonal sales and frequent names. In order to realistically scale the benchmark from small to large datasets, fact tables scale linearly while dimensions scale sub linearly. The benchmark abstracts the diversity of operations found in an information analysis application, while retaining essential performance characteristics. As it is necessary to execute a great number of queries and data transformations to completely manage any business analysis environment, TPC-DS defines 99 distinct SQL-99 (with OLAP amendment) queries and twelve data maintenance operations covering typical DSS like query types such as ad-hoc, reporting, iterative (drill down/up), and extraction queries and periodic refresh of the database. The metric is constructed in a way that favors systems that can

284 Chapter A Classification of Big Data Benchmarks overlap query execution with updates (trickle updates). As with TPC-H full ACID characteristics are required. Implementation with more than 50 sample queries is available for Hive [Apa15c].

TPC-DS v2

TPC-DS V2 [PRJ17] is based on TPC-DS V1, but specifically addresses the domain of SQL-based big data systems. The main modifications are in the areas of database load, ACID, incremental data integration, queries, metric and execution rules. Some queries were modified together with the execution rules and metrics to emphasize the performance characteristics of big data systems.

LinkBench

LinkBench [Arm+13] is a benchmark, developed by Facebook, using synthetic social graph to emulate social graph workload on top of databases such as MySQL and MongoDB.

ShenZhen transportation system (SZTS)

ShenZhen Transportation System (SZTS) [Xio+16] is a big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS targets a real-life application domain unlike other Hadoop benchmark suites (e.g. HiBench or CloudRank-D) consist of generic algorithms with synthetic inputs. The workloads of SZTS are across several layers namely: the micro-architecture level, the operating system (OS) level, and the job level.

TPCx-V

The TPCx-V benchmark [Bon+15] measures the performance of a server running virtualized databases. It simulate a mix of Online Transaction Processing (OLTP) and Decision Support Systems (DSS) workloads in cloud computing environment.

A.4 Streaming Benchmarks

Linear Road

Description: Linear Road [Ara+04] is a simulation of a large metropolitan city which is 100 miles wide and long and consists of 10 parallel expressways. The task of the implemented system is to determine variable tolling against traffic congestion

A.4 Streaming Benchmarks 285 in urban areas. The idea is to issue higher tolls through peak times or in case of an accident to lower traffic congestion [Jep+18].

Benchmark type and domain: The Linear Road Benchmark is an open source benchmark to compare the performance of Stream Data Management Systems (SDMS) between themselves and other systems like relational databases. SDMS process streaming data to generate real-time query results by executing historical and continuous queries.

Workload: The benchmark manages statistics about the number of vehicles and average speed on each segment of each expressway for every minute. It executes continuously queries while detecting accidents and notifying the other vehicles for these. At the same time the dynamic tolls are being calculating and assessing, which are dependent on segment statistics and proximate accidents and keep track of all assessed tolls. The toll must be calculated every time a vehicle sends a position report in a new segment or the driver needs a notification. The requirements for the response time is 5 seconds between the dispatch of the position report and the time the toll notification is sent. Furthermore, the system must process the historical queries. For account balance queries, it must return the sum of all tolls in a response time of 5 seconds and an accuracy of 60 seconds prior to the time the request is issued. For daily expenditure queries it must return the sum of tolls which are spent on an expressway at a given day of the last 10 weeks.

Data type and Generation: The input data is stored in flat files and generated through a simulation traffic model by the traffic simulator MITSIM. It generates a set of vehicles and each completes a vehicle trip with focus on the downtown area. The input stream data are tuples which are split in position reports and historical query requests. These historical query requests can be for Account Balances, Daily Expenditures and Travel Time Estimations. The position reports are tuples which contains an integer timestamp, the vehicle identifier and some information about the vehicle trip. There is a probability of 1% for position reports that they contain additionally a historical query request, which is in 50% of the cases an account balance request, in 10% of the cases a daily tolls request and in 40% of the cases a travel time request. Systems must maintain all assessed tolls always to answer historical query requests. Furthermore, the historical data generator constructs two files which contain the toll history for the previous 10 weeks. The first file saves tuples which contain information about the vehicle, day, expressway and tolls. The second file is a segment history file which contains information about each segment like number of vehicles, toll and average speed. 10 weeks of tolling history must be available.

Metrics: The performance is measured through the L-rating, whereby L means L expressways worth of input. It measures the supported query load, represented by the historical and continuous queries, which the stream processing system can process while the constraints of response time and accuracy are still fulfilled. To determine the performance, the benchmark will be run with increasing scale factors, until there is one for which the requirements can no longer be met.

Implementation and technology stack: To implement Linear Road, it is necessary to generate 10 weeks of historical data with the Historical Data generator; generate

286 Chapter A Classification of Big Data Benchmarks L flat files, each containing 3 hours traffic data and historical query requests from a single expressway with the traffic simulator. Then the system must generate the output files which contain the response to the queries. Then the validation tool is used to check the response times and accuracy of generated output.

StreamBench

Description: StreamBench [Lu+14] is a benchmark suite containing seven different micro-benchmark programs. These are Identity, Sample, Projection, Grep, Word- count, DistinctCount and Statistics. The simulated applications are using data from real time web log processing and network traffic monitoring domain.

Benchmark type and domain: StreamBench contains seven different micro-benchmark programs and consists of four workload suites targeting different technical charac- teristics of the streaming systems.

Workload: The benchmark consists of four different workload suites. The Perfor- mance workload suite uses all seven programs and reads and processes the data from the messaging system as fast as it can. The Multi-recipient performance workload suite also uses the seven benchmarks on only one dataset. It defines three different cluster configurations called reception ability, to be the proportion of nodes that receive input data out of the whole cluster. The Fault tolerance workload suite also includes the seven micro-benchmarks and considers failure of only one cluster node intentionally failing in the middle of the execution. The Durability workload suite contains only the Wordcount program and two data scale sizes (factors).

Data type and Generation: The benchmark suite uses different data scale sizes generated from two datasets. The AOL Search Data set is a collection of real query log data from real users, whereas the CAIDA Anonymized Internet Traces Dataset consists of statistical information of an hour-long internet package traces. The datasets cover both text and numerical data, but have different number of attributes and number of records.

Metrics: There are different metrics for the different workload suites. The main metrics are throughput (in bytes processed per second) and latency (the average time span from the arrival of a record until the record is processed). The throughput penalty factor (TPF) and latency penalty factor (LPF) are both defined and reported in the fault-tolerance workload suite.

Implementation and technology stack: The benchmark suite is implemented and evaluated with the and Apache Spark Streaming frameworks. Apache Kafka is used as a messaging system.

Yahoo Streaming Benchmark (YSB)

Description: The YSB [Chi+16; Yah18b; Yah18c] is a simple advertisement appli- cation benchmark. There are a number of advertising campaigns, and a number of

A.4 Streaming Benchmarks 287 advertisements for each campaign. The benchmark reads the events in JSON format, processes and stores them into a key-value store. These steps attempt to probe some common operations performed on data streams.

Benchmark type and domain: The Yahoo Streaming Benchmark is a streaming application benchmark simulating an advertisement analytics pipeline.

Workload: The analytics pipeline processes a number of advertising campaigns, and a number of advertisements for each campaign. The job of the benchmark is to read various JSON events from Kafka, identify the relevant events, and store a windowed count of relevant events per campaign into Redis. The benchmark simulates common operations performed on data streams:

1. Read an event from Kafka.

2. Deserialize the JSON string.

3. Filter out irrelevant events (based on event-type field).

4. Take a projection of the relevant fields (ad-id and event-time)

5. Join each event by ad-id with its associated campaign-id. This information is stored in Redis.

6. Take a windowed count of events per campaign and store each window in Redis along with a timestamp of the time the window was last updated in Redis. This step must be able to handle late events.

Data type and Generation: The data schema consists of seven attributes and is stored in JSON format: (1) user-id: UUID; (2) page-id: UUID; (3) ad-id: UUID; (4) ad-type: String in banner, modal, sponsored-search, mail, mobile; (5) event-type: String in view, click, purchase; (6) event-time: Timestamp; (7) ip-address: String.

Metrics: The reported metrics by the benchmark are: (1) Latency as window.final- event-latency = (window.last-updated-at – window.timestamp) – window.duration; and (2) Aggregate System Throughput

Implementation and technology stack: The YSB benchmark is implemented using Apache Storm, Spark, Flink, Apex, Kafka and Redis.

RIoTBench

RIoTBench [SS16b; SCS17; Lab18] is a Real-time IoT Benchmark suite targeting Distributed Stream Processing Systems (DSPS). It consists of 27 micro-benchmarks and 4 real-workload streaming application benchmarks, constructed using the micro- benchmarks. The micro-benchmarks represent common processing and analytics tasks performed over real-time data streams such as parse, filter, statistical analytics, predictive analytics, pattern detection and I/O operations. The four application

288 Chapter A Classification of Big Data Benchmarks benchmarks are representing the main IoT categories: extract-transform-load (ETL) and archival, prediction and pattern detection, classification and notification, and summarization and visualization. The main metrics are throughput (messages per second), latency (processing time), jitter (deviation of the output throughput from the ideal throughput) and CPU and Memory utilization.

AIM Benchmark

Description: The AIM benchmark [Kip+17] simulates a challenging use case of how to store and analyze billing data of subscribers and make marketing campaigns immediately available. The task is to process single events like phone calls or messages and to do real-time analytics which are represented by seven analytical queries [Bra17]. The workload is well-defined and fits perfectly in the class of analytics on fast data.

Benchmark type and domain: The AIM Benchmark developed by Huawei Tech- nologies Co. Ltd, applies an use case from the telecommunication industry based on a real-live example for analytics on stateful streams.

Workload: The workload of the AIM Benchmark is based on Analytics Matrix and is divided into two parts. First, the system must process the stateful streaming workload depicted by the events which generate sales and marketing information through phone calls. This is called Event Stream Processing (ESP) and is divided in two phases. The ESP should update the Analytics Matrix immediately after an event arrives. As a default there are 10.000 events per second and each consist of a subscriber id (to identify the subscriber) and call-dependent details such as the call’s duration, cost and type. Then in the second phase the updated Analytics Matrix is made available for analytical queries and the updated record and event are checked against a set of triggers [Bra17]. There are 7 standard queries which are continuous queries from one or multiple clients which are answered by the Analytical Matrix. Each query is executed with the same probability. The state of the Matrix should not be older than 1 second. For example, one query selects all local and long-distance calls per region with the category “eat”.

Data type and Generation: The main part of the data for the benchmark is the Analytics Matrix. This Matrix contains aggregated data for each subscriber, identified by the subscriber id. Each row of the matrix represents a subscriber. The columns represent the aggregated data for each combination of the aggregation functions like sum, min, max and aggregation window like the day and several event attributes. By default, the Matrix consists of 546 columns and 10 million rows. Furthermore, there are links into the dimension table through foreign keys. The dimension tables contain the information and structure of the data in the Analytical Matrix like Region Info and Subscription Type [Bra17]. There is an open-source AIM schema generator which can be used to generate the aggregates structure [Pro18].

Metrics: The performance of the implemented system is measured as the query throughput dependent on the available amount of threads and can be distinguished in the overall performance, the read and write performance. The read performance

A.4 Streaming Benchmarks 289 focuses on the analytic queries, while the write performance on the event processing measured as the response time, with and without concurrent writes, for the events per second. To test the performance the number of clients and maintained aggregates can be varied.

Implementation and technology stack: The AIM Benchmark can be implemented on several systems like multimedia databases (MMDBs) such as HyPer or Tell, modern streaming systems like Flink and hand-crafted systems. There is a Tell implementation available in GitHub [Pro18]. In HyPer the analytics matrix can be implemented as a regular database table and the real time analytics as SQL queries on this table. The handcrafted AIM System is designed for the AIM Benchmark and so it achieves the best performance on the workload and can be used as a performance orientation for other implementations.

Senska

Description: Senska [Hes+17] is an enterprise streaming benchmark, currently under development, for comparing stream processing architectures in enterprise scenarios. The benchmark was suggested in 2017 by researchers from the Hasso Plattner Institute at the University of Potsdam.

Benchmark type and domain: Senska is limited by its domain industrial manufac- turing and is focused on one application field and cannot be used for all domains. After the development is completed, Senksa should also provide a toolkit, to have a complete solution for the domain problems.

Workload: Senska is following the domain-specific criteria defined by Gray [Gra92]. While keeping relevance, portability, scalability, and simplicity in focus, Senska do have three main components: the data feeder, system under test (SUT) and the result validator. As these components have to interact, there are some additional compo- nents which are responsible for the communication between the main components. The main queries handle nine main testing aspects: windowing, transformation, merging (union), filtering (selection/projection), sorting/ranking, correlation/en- richment (join), machine learning, and combination with DBMS data. In the first described query set seven of these nine aspects are already fulfilled within five use cases. Only merging (union) and sorting/ranking are not yet covered by the actual query definitions.

Data type and Generation: Senska takes as input data a csv-file which should contain representative data for a manufacturing context (e.g. sensor data). These data will be handled by the data feeder which puts the data into the SUT via communication channels. Here, only Apache Kafka can be used as data feeder, therefore the SUT has to be able to communicate with this message broker. The SUT is split into two parts, the Benchmark Query Implementation and the DBMS. The benchmark query implementation executes the actual benchmark queries, while the DBMS is used to feed the benchmark with historical data whenever needed. This behavior is unique to the Senska benchmark, as all other benchmarks do only use sensor data for their results.

290 Chapter A Classification of Big Data Benchmarks Metrics: During the execution of the benchmark queries, the results are send to the result validator. This component will then ensure the correctness of the query implementation and calculate the benchmark metrics. Unfortunately, there is no information about which metrics the benchmark will handle after its release.

Implementation and technology stack: There is no output data available yet. As the developers wanted to have their benchmark as close to real-world scenarios as possible and wanted to implement a benchmark for a combination of streaming and transactional data, they defined a core set of queries which should be able to handle realistic data sets in an enterprise context.

A.5 Machine and Deep Learning Benchmarks

Sanzu

Description: Sanzu [WBR17] is a Machine Learning benchmark developed by Big Data System and Analytics Lab at University of New Brunswick. The benchmark suite is developed to capture the complete data science process, from collection of data via data wrangling to the data model building.

Benchmark type and domain: The benchmark provides a set of python programs, a synthetic dataset generator and real-world datasets that help evaluate the func- tionality and scalability of five popular data science platforms, namely Anaconda, R package, Dask, PostgreSQL with MADLib, PySpark.

Workload: It contains a micro benchmark and a macro benchmark. The micro benchmark consists of six workloads which are Basic File I/O, Data Wrangling, De- scriptive Statistical, Distribution and Inferential Statistics, Time Series and Machine Analyses. The macro benchmark contains two applications that are modelled based on real-world use cases, namely Smart Grid Analytics and Sport Analytics, both of which involve reading data from files, data wrangling and model building.

Data type and Generation: In the micro benchmark, datasets are generated from a synthetic data generator. Each dataset is generated under a given scale factor ranging from 1 million rows to 100 million rows. The types and schema of data contains time series, string, integer, float, even sequential time series. Some of the columns are chosen uniformly from a list while others are chosen from a normal distribution, an exponential distribution. For the macro benchmark, it uses real-world data resources.

Metric: The metric used to evaluate the performance and functionality of 5 popular data science platforms is the execution time measured from the completion of a set of tasks. The scale factors range from 1 million data rows per table, 10 million data rows per table, to 100 million data rows per table.

Implementation and technology stack: In order to run the Sanzu benchmark, one should first install the five platforms that are Anaconda Python, R, Dask, PostgreSQL with MADLib and Spark. The following step is generating datasets by running a shell

A.5 Machine and Deep Learning Benchmarks 291 script file (create-dataset.sh). Next, one can run the tasks in python console, the results of which will be stored in the directory under benchmark/benchmark.csv.

Penn Machine Learning Benchmark (PMLB)

Description: The Penn ML Benchmark (PMLB) [Ols+17], is a new evolving set of benchmark standards for comparing and evaluating datasets from many of the most-used ML benchmark suite.

Benchmark type and domain: PMLB is a ML benchmark. This competitive bench- mark evaluates 13 supervised ML classification methods from Scikit-Learn [Ped+11].

Workload: The main part of the workload is to compare the datasets in PMLB, which are clustered based on their meta-features, and to analyze the datasets based on ML performance, which identifies which datasets can be solved with high or low accuracy.

Data type and Generation: PMLB is initialized with 165 real-word, simulated and toy benchmark datasets and evaluate the performance of 13 standard statistical methods from scikit-Learn [Ped+11] over the full set of PMLB datasets.

Metrics: ML methods are evaluated using balanced accuracy as the scoring met- ric. This is a normalized version of accuracy that accounts for class imbalance by calculating accuracy on a per-class basis then averaging the per-class accuracies.

Implementation and technology stack: When each ML method is evaluated, the features of the datasets are scaled by subtracting the mean and scaling the meta- features of the datasets into 5 clusters. To find the best parameters for each ML method on each dataset, a comprehensive grid search of each of the ML method’s parameters is performed using 10-fold cross-validation. All clusters are compared in more detail according to the mean values of the dataset meta-features in each cluster. Using a spectral biclustering algorithm Kluger et al. [Klu+03], the 13 ML models and 165 datasets are biclustered according to the balanced accuracy of the models using their best parameter setting.

OpenML benchmark suites [Bis+17]

Most of the Benchmarks (such as PMLB, Keel, etc.) do not provide APIs. In order to enable the researchers to upload or download data repositories in standardized formats into popular Machine Learning (ML) libraries and compare the results, the online platform OpenML is provided. It can be used to share all components of a ML experiment including code, datasets, tasks, experiment parameters and evaluation results. OpenML works with the concept of tasks and datasets. Datasets are the same as the datasets we defined. A task consists of datasets, ML tasks such as classification, and an evaluation method such as cross-validation. When researchers plan some experiments, they can explore the datasets that are available in OpenML and find a suitable learning task for their experiments. The datasets can be easily extended and

292 Chapter A Classification of Big Data Benchmarks OpenML benchmark suites can be created via the API. The OpenML 100 is a standard benchmark, providing 100 high-quality datasets from thousands of datasets available on the OpenML web portal. The classification of datasets for this benchmarking suite requires the number of observations to be between 500 and 100000, maximal 5000 features, at least two classes and the ratio above 0.05 for minority class and majority class. The datasets that do not satisfy these requirements are excluded from the OpenML 100.

MLbench

The MLBench benchmark [Liu+18], inspired by Kaggle, consists of datasests with a best-effort baseline of both feature engineering and machine learning models. It uses a novel metric based on the notion of "quality tolerance" that measures the performance gap between a given machine learning system and top-ranked Kaggle performers. Currently are available 7 binary classification datasets, 5 multi-class classification datasets and 5 regression datasets.

DeepMark

Description: DeepMark [Chi18] or convnet-benchmarks is an open-source frame- work for benchmarking a collection of Convolutional Neural Networks. Convolu- tional Neural Networks are a special kind of neuronal networks which are specifically designed for processing data that has a known grid-like topology [Dee18b]. For instance time-series data, that can be seen as a simple one dimensional grid (1-D grid), or image data, that is a more complex two dimensional grid (2-D grid) which consists of pixels. For processing those grid-like data Convolutional Networks, a mathematical operation called convolution is used.

Benchmark type and domain: DeepMark is performance benchmark that belongs to the machine learning domain. It covers four major uses cases: Images, Video, Audio and Text processing.

Workload: For every use case, that DeepMark covers, a different workload is chosen. For image training the ImageNet data set is used. For video recognition the data set Sports-1M is used.

Data type and Generation: Most of the data are publicly available data sets like ImageNet that use NCHW as structure. ImageNet consists 14,197,122 images [Ove18] and Sports-1M provides 1,133,158 video-url’s [Spo18].

Metrics: While the time measurement is mostly written in python or bash script, it will track, for each network defined epoch-time, the round-trip time for a single epoch of training. Also maximum batch-size will be defined according to the memory consumption, each framework uses.

Implementation and technology stack: Every application of deep learning models has different neural networks attached to it. For example, Recurrent Neural Networks

A.5 Machine and Deep Learning Benchmarks 293 are well suited for Speech recognition while Convolutional Neural Networks are especially good at image recognition.

DeepBench

Description: DeepBench [Dee18a] is a benchmark that focuses on one simple question. Which hardware will have the best performance when running very basic and fundamental neural network operations that are important for training a deep neural network. The result from this benchmark are by no stretch a good indicator for how long it will take to train an entire model. This might give some insight on possible bottlenecks in the deep learning training and inference, regarding the hardware. DeepBench includes a training and an inference benchmark that implement all the basic operations such as GEMM, Convolution and RNN.

Benchmark type and domain: DeepBench falls under the performance benchmark category and belongs to the machine learning domain.

Workloads: Workloads differs for each benchmarked operation:

1. Dense Matrix Multiply: Matrices with specified sizes

2. Convolution: Data in NCHW format

3. Recurrent Layers: Networks with set hidden units

4. All Reduce: Data with a set number of oat numbers

Data type and Generation: No real data is used for the benchmarks. For all benchmarks random numbers are generated in a fitting format. For Matrix Multiply, this would be a matrix filled with random numbers for example. The used data is fundamentally very basic and small and results in fast benchmark times. Data is generated at run-time.

Metrics: The benchmark measures time in milliseconds, FLOPS and bandwidth in GB/s.

Implementation and technology stack: The main library used to implement the operations are NVIDIA’s cuDNN and OpenMPI. The code itself was written in C++. Communication type operations are implemented with MPI. Note that not every hardware is supported by this benchmark.

Fathom

Description: The Fathom [Ado+16] benchmark includes eight state-of-the-art deep learning models and investigates their differences in similarity, their execution time, their performance and the effects of parallel scalability. Fathom has the three properties: representativeness, diversity and impact.

294 Chapter A Classification of Big Data Benchmarks Benchmark type and domain: It is a deep learning benchmark that compares state-of-the-art models.

Workloads:

1. Sequence-to-Sequence Translation - This is a recurrent neural network for solving machine translations which uses a single-layer pipeline of long short- term memory (LSTM) neurons to get the meaning of a sentence and translate it into another language.

2. End-to-End Memory Networks - Memory networks makes it possible to sepa- rate the state from the structure of a neural network.

3. Deep Speech - The Deep Speech model is a recurrent fully-connected model. It is a pure deep learning algorithm that uses spectrograms directly as inputs and learns to transcribe phonemes.

4. Variational Autoencoder - Autoencoders are used for feature extraction or data generation.

5. Residual Networks - Additional identity connections across every pair of convolutional layers were added, which allows to be more than 150 layers deep.

6. VGG-19 - This is an image classifier with more layers of smaller convolutional filters that are easier to train and it ends in a reduced number of parameters.

7. AlexNet - This is an older model which builds the basis of newer advanced models. It introduced dropout as a regularization tool and showed the compu- tational power of GPUs.

8. Deep Reinforcement Learning - A concept of system algorithm which won dozens of Atari games even against human experts.

Data type and Generation: The used data are either language or speech, images or even a game (Atari). The databases are mostly open source or available from already published papers.

Metrics: Fathom measures time and accuracy and compares the workloads depend- ing on their relationship between these two metrics.

Implementation and technology stack: The models are currently implemented in TensorFlow.

DAWNBench

Description: [Col+18] The DAWNBench is an end-to-end deep learning benchmark which takes into account whether the system as a whole will reach a high quality

A.5 Machine and Deep Learning Benchmarks 295 result (not only just measure the processing time of one mini-batch. It notes both the end-to-end training time to achieve a state-of-the-art accuracy but also the inference of that accuracy. The benchmark is open source and currently under development.

Benchmark type and domain: It is a deep learning type of benchmark.

Workload: DAWNBench offers the possibility to define custom values for the choice of optimizer, batch size, multi-GPU training and stochastic depth.The available optimizers are Adam, Single-node multi-GPU training and Stochastic Depth. Adam is an adaptive optimizer for gradient descent. At Stochastic Depth entire layers are randomly dropped during training in order to prevent co-adaptation. They investigate three different batch sizes: 32, 256 and 2048. For the models can be chosen different ResNet architectures: ResNet20, ResNet56 and ResNet164. It can be tested on different hardware platforms: GPUs and CPUs with different kernel sizes.

Data type and Generation: It uses the ImageNet and CIFAR10 databases and question and answer on SQuAD.

Metrics: Four metrics are available: 1) training time, to a specified validation accuracy: 2) cost; 3) average latency of performing inference on a single item (image or question) and 4) average cost of inference for 10 000 items.

Implementation and technology stack: The current implementations are on Py- Torch and TensorFlow.

MLPerf

Description: The MLPerf [MLP18] effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services.

Benchmark type and domain: The main goals of MLPerf are formulated as [4]:

• Accelerate progress in ML via fair and useful measurement.

• Enable fair comparison of competing systems yet encourage innovation to improve the state-of-the-art of ML.

• Keep benchmarking effort affordable so all can participate.

• Serve both the commercial and research communities.

• Enforce replicability to ensure reliable results.

Workload: MLPerf set of benchmarks tries to cover the most important areas of machine learning tasks.

296 Chapter A Classification of Big Data Benchmarks Data type and Generation: It aims to collect publicly available data sets and models for the following problems: Image classification, Object detection, Translation, Recommendation, Reinforcement Learning, Speech to text and Sentiment Analysis.

Metrics: The main performance metric in MLPerf is a wall clock time to train a model on the specified dataset to achieve the specified quality target.

Implementation and technology stack: The results from different contributors are scaled by the training time of an unoptimized reference implementation using NVIDIA Pascal P100 GPU in order to calculate and compare the speedups between the different hardware and software implementations.

A.6 Graph Benchmarks

Semantic Publishing Benchmark (SPB)

The The Semantic Publishing Benchmark v2.0 (SPB) [LDB19a] is a LDBC benchmark for RDF database engines inspired by the Media/Publishing industry, particularly by the BBC’s Dynamic Semantic Publishing approach.

Social Network Benchmark

The Social Network Benchmark (SNB) [LDB19b] consists of a data generator that generates a synthetic social network, used in three workloads: Interactive, Business Intelligence and Graph Analytics.

Graphalytics

Graphalytics [Ios+16] is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

GARDENIA

Gardenia [Xu+17] is a domain-specific benchmark suite consisting of irregular graph workloads. These workloads mimic actual machine learning and big data applica- tions running on modern datacenter accelerators using state-of-the-art optimization techniques.

A.6 Graph Benchmarks 297 WatDiv

WatDiv [Alu+14] measures how an RDF data management system performs across a wide spectrum of SPARQL queries with varying structural characteristics and selectivity classes. It consists of two components: the data generator and the query (and template) generator.

Streaming WatDiv

Stream WatDiv [Gao+18] is an open-source benchmark for streaming RDF data management systems. It extends the existing WatDiv benchmark, and includes a streaming data generator, a query generator that can produce a diverse set of SPARQL queries, and a testbed to monitor correctness and latency.

Gmark

gMark [Bag+17] is a domain- and query language-independent framework targeting highly tunable generation of both graph instances and graph query workloads based on user-defined schemas.

A.7 Emerging Benchmarks

IDEBench

IDEBench [Eic+18] measures the performance of interactive data exploration sys- tems over the course of entire user-centered workflows, where queries are built and refined incrementally and executed with delays (thinktime) between queries, rather than being processed back-to-back. Each workflow comprises a sequence of interactions performed by users: Creating a visualization (i.e., the starting query), filtering/selecting, linking, and discarding a visualization.

Blockbench

BlockBench [Din+17] is the first benchmarking framework for private blockchain systems. It serves as a fair means of comparison for different platforms and enables deeper understanding of different system design choices. It comes with both macro benchmark workloads for evaluating the overall performance and micro benchmark workloads for evaluating performance of individual layers.

298 Chapter A Classification of Big Data Benchmarks PolyBench

Polybench [KRM18] is the first benchmark for heterogeneous analytics systems, especially for polystores, providing a complete evaluation environment. Polybench is an application-level benchmark that simulates a banking business model. It focuses on banking, since it features heterogeneous analytics and data types. The benchmark suite consists of three main use-cases and two test scenarios. The use-cases operate with structured, semi-structured, and unstructured data types and support relational, stream, array, and graph data processing paradigms. The benchmark is not tied to a specific polystore technology, rather, it is generic and high level. PolyBench provides a benchmark suite with evaluation metrics and workloads, which will eventually lead to better baselines.

A.8 Benchmark Platforms

Benchmarking platforms are systems and tools that facilitate the different phases of executing and evaluating benchmark results. These include: benchmark planning, server deployment and configuration, execution and queuing, metrics collection, data and results management, data transformation, error detection, and evaluation of results. The evaluation of results can be either by individual benchmarks or by group of benchmarks.

Aloja Benchmarking Platform

The ALOJA research project [Pog+14] is an initiative from the Barcelona Supercom- puting Center (BSC) to produce a systematic study of Hadoop configuration and deployment options. The project provides an open source platform for executing Big Data frameworks in an integrated manner facilitating benchmark execution and evaluation of results. ALOJA currently provides tools to deploy, provision, configure, and benchmark Hadoop, as well as providing different evaluations for the analysis of results covering both software and hardware configurations of executions.

The project also hosts the largest public Hadoop benchmark repository with over 42,000 executions from HiBench. The online repository can be used as a first step to understand and select benchmarks to execute in the selected deployment and reduce benchmarking efforts by sharing results from different systems. The repository and the tools can be found online [BSC14].

Liquid Benchmarking Platform

Liquid Benchmarking [SC11; Sak+15; She15] is an online cloud-based platform for democratizing the performance evaluation and benchmarking processes. The goals of the project are to:

A.8 Benchmark Platforms 299 • Dramatically reduce the time and effort for conducting performance evaluation processes by facilitating the process of sharing the experimental artifacts (software implementations, datasets, computing resources, and benchmarking tasks) and enabling the users to easily create, mashup, and run the experiments with zero installation or configuration efforts.

• Support for searching, comparing, analyzing, and visualizing (using different built-in visualization tools) the results of previous experiments.

• Enable the users to subscribe for notifications about the results of any new running experiments for the domains/benchmarks of their interest.

• Enable social and collaborative features that can turn the performance eval- uation and benchmarking process into a living process where different users can run different experiments and share the results of their experiments with other users.

Hobbit Benchmarking Platform

The HOBBIT evaluation platform [NR16] is a distributed FAIR benchmarking plat- form for the Linked Data lifecycle. This means that the platform was designed to provide means to: (1) benchmark any step of the Linked Data lifecycle, including generation and acquisition, analytics and processing, storage and curation as well as visualization and services;(2) ensure that benchmarking results can be found, accessed, integrated and reused easily (FAIR principles); (3) benchmark Big Data platforms by being the first distributed benchmarking platform for Linked data.

SQALPEL

SQALPEL [Ker+19] is a SaaS solution to develop and archive performance projects.It steps away from fixed benchmark sets into queries taken from the application and turning them into a grammar as a description of a much larger query space. The system explores this space using a guided random walk to find the discriminative queries. SQALPEL offers the following contributions:

• extending the state of the art in grammar based database performance evalua- tion;

• providing a full fledge database performance repository to share information easily and publicly;

• bootstraping the platform with a sizable number of OLAP cases and products.

300 Chapter A Classification of Big Data Benchmarks Cluster Hardware B

The following tables describe the experimental hardware used in multiple perfor- mance studies (Sections 3.3, 3.4, 3.5, 4.1 and 4.2).

Tab. B.1.: Master Node Specifications

System Information Description Manufacturer: Dell Inc. Product Name: PowerEdge T420 BIOS: 1.5.1 Release Date: 03/08/2013 Memory Total Memory: 32 GB DIMMs: 10 Configured Clock 1333 MHz Part Number: Speed: M393B5273CH0-YH9 Size: 4096 MB CPU Intel(R) Xeon(R) CPU Model Name: E5-24200 @1.90GHz Architecture: x86_64 CPU(s): 24 On-line CPU(s) 0-23 list: Thread(s) per 2 core: Core(s) per 6 socket: Socket(s): 2 CPU MHz: 1200 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 0,2,4,6,8,10,12,14,16,18, CPU(s): 20,22

301 Continuation of Table B.1 NUMA node1 1,3,5,7,9,11,13,15,17,19, CPU(s): 21,23 NIC Settings for em1: Speed: 1000Mb/s Ethernet con- Broadcom Corporation troller: NetXtreme BCM5720 Gi- gabit Ethernet PCIe Storage Storage Con- LSI Logic / Symbios 08:00.0 RAID bus controller troller: Logic MegaRAID SAS 2008 [Falcon] (rev 03) Drive / Name Usable Space Model Disk 1/ sda1 931.5 GB Western Digital, WD1003FBYX RE4-1TB, SATA3, 3.5 in, 7200RPM,64MB Cache End of Table

Tab. B.2.: Master Node Specifications

System Information Description Manufacturer: Dell Inc. Product Name: PowerEdge T420 BIOS: 2.1.2 Release Date: 01/20/2014 Memory Total Memory: 32 GB DIMMs: 4 Configured Clock 1600 MHz Part Number: Speed: M393B2G70DB0-YK0 Size: 16384 MB CPU Intel(R) Xeon(R) CPU Model Name: E5-2420 v2 @ 2.20GHz Architecture: x86_64 CPU(s): 12 On-line CPU(s) 0-11 list: Thread(s) per 2 core: Core(s) per 6 socket:

302 Chapter B Cluster Hardware Continuation of Table B.2 Socket(s): 1 CPU MHz: 2200 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 0-11 CPU(s): NIC Settings for em1: Speed: 1000Mb/s Ethernet con- Broadcom Corporation troller: NetXtreme BCM5720 Gi- gabit Ethernet PCIe Storage Storage Con- Intel Corporation 00:1f.2 RAID bus controller troller: C600/X79 series chipset SATA RAID Controller (rev 05) Drive / Name Usable Space Model Disk 1/ sda1 931.5 GB Dell- 1TB, SATA3, 3.5 in, 7200RPM, 64MB Cache Disk 2/ sdb1 931.5 GB WD Blue Desktop WD10EZEX - 1TB, SATA3, 3.5 in, 7200RPM, 64MB Cache Disk 3/ sdc1 - - Disk 4/ sdd1 - - End of Table

303

Evaluating Hadoop Clusters with C TPCx-HS

The following tables depict the resource utilization of the Master and Worker Nodes during the experiments performed in section 3.3.

Tab. C.1.: Master Node - Resource Utilization

Master Node Network Type: Dedicated 1Gbit Shared 1Gbit Scale Factor: 100GB 100GB Avg. CPU Utilization % - User % 1.09 0.75 Avg. CPU Utilization % - System % 0.83 0.78 Avg. CPU Utilization % - IOwait % 0.02 0.01 Memory Utilization % 48.04 91.41 Avg. Kbytes Transmitted per Second 42.28 18.96 Avg. Kbytes Received per Second 52.66 21.78 Avg. Context Switches per Second 10756.6 10360.07 Avg. Kbytes Read per Second 0.14 0 Avg. Kbytes Written per Second 31.39 25.23 Avg. Read Requests per Second 0.01 0 Avg. Write Requests per Second 3.12 1.74 Avg. I/O Latencies in Milliseconds 0.15 0.17 End of Table

Tab. C.2.: Worker Node - Resource Utilization

Worker Node Network Type: Dedicated 1Gbit Shared 1Gbit Scale Factor: 100GB 100GB Avg. CPU Utilization % - User % 56.04 12.44 Avg. CPU Utilization % - System % 9.52 3.71 Avg. CPU Utilization % - IOwait % 3.61 1.28 Memory Utilization % 92.31 92.93 Avg. Kbytes Transmitted per Second 31363.57 6548.16 Avg. Kbytes Received per Second 33636.93 7297.27

305 Continuation of Table C.2 Network Type: Dedicated 1Gbit Shared 1Gbit Avg. Context Switches per Second 20788.16 14233.08 Avg. Kbytes Read per Second 6532.24 1438.23 Avg. Kbytes Written per Second 19010.01 4087.33 Avg. Read Requests per Second 111.75 24.7 Avg. Write Requests per Second 39.5 9.71 Avg. I/O Latencies in Milliseconds 136.87 69.83 End of Table

306 Chapter C Evaluating Hadoop Clusters with TPCx-HS BigBench V2 - New Queries D Implementation

This appendix lists HiveQL implementation of BigBench V2 queries presented in section 4.1.

D.1 BigBench Q05

1 select 2 i_name , 3 count(*) as cnt 4 from 5 web_pages , 6 product , 7 ( select 8 js.wl_user_id, 9 js.wl_product_id, 10 js.wl_webpage_name 11 from web_logs 12 lateral view 13 json_tuple 14 (web_logs.line, 15 ’wl_user_id’, ’wl_product_id’, 16 ’wl_webpage_name’ 17 ) js as 18 wl_user_id, wl_product_id, 19 wl_webpage_name 20 where 21 js.wl_user_id is not null 22 and js.wl_product_id is not null 23 ) logs 24 where 25 logs.wl_webpage_name = w_web_page_name 26 and w_web_page_type = ’product look up’ 27 and logs.wl_product_id = i_product_id 28 group by i_name 29 order by cnt desc 30 limit 10;

D.2 BigBench Q06

1 drop view if exists browsed; 2 create view browsed as 3 select 4 wl_product_id as br_id, 5 count(*) as br_count 6 from 7 web_pages , 8 ( select

307 9 js.wl_user_id, 10 js.wl_product_id, 11 js.wl_webpage_name 12 from web_logs 13 lateral view 14 json_tuple 15 (web_logs.line, 16 ’wl_user_id’, ’wl_product_id’, 17 ’wl_webpage_name’ 18 ) js as wl_user_id, 19 wl_product_id, wl_webpage_name 20 where 21 js.wl_user_id is not null 22 and js.wl_product_id is not null 23 ) logs 24 where 25 wl_webpage_name = w_web_page_name 26 and w_web_page_type = ’product look up’ 27 group by wl_product_id; 28 29 drop view if exists purchased; 30 create view purchased as 31 select 32 wl_product_id as pu_id, 33 count(*) as pu_count 34 from 35 web_pages , 36 ( select 37 js.wl_user_id, 38 js.wl_product_id, 39 js.wl_webpage_name 40 from web_logs 41 lateral view 42 json_tuple 43 (web_logs.line, 44 ’wl_user_id’, ’wl_product_id’, 45 ’wl_webpage_name’ 46 ) js as 47 wl_user_id, wl_product_id, 48 wl_webpage_name 49 where 50 js.wl_user_id is not null 51 and js.wl_product_id is not null 52 ) logs 53 where 54 wl_webpage_name = w_web_page_name 55 and w_web_page_type = ’add to cart’ 56 group by wl_product_id; 57 58 select 59 i_product_id, 60 (br_count-pu_count) as cnt 61 from 62 browsed, purchased, product 63 where 64 br_id = pu_id 65 and br_id = i_product_id 66 order by cnt desc 67 limit 5;

308 Chapter D BigBench V2 - New Queries Implementation D.3 BigBench Q07

1 drop view if exists sessions; 2 3 create view sessions as 4 select 5 uid, item, wptype, tstamp, 6 concat(sessionize.uid, 7 concat(’_’,sum(new_session) 8 over (partition by sessionize.uid 9 order by sessionize.tstamp)) 10 ) as session_id 11 from ( 12 select 13 logs.wl_user_id as uid, 14 logs.wl_item_id as item, 15 w.w_web_page_type as wptype, 16 unix_timestamp(logs.wl_ts) as tstamp, 17 case 18 when (unix_timestamp(logs.wl_ts) 19 - lag(unix_timestamp(logs.wl_ts)) 20 over (partition by logs.wl_user_id 21 order by logs.wl_ts) 22 ) >= 600 23 then 1 24 else 0 25 end as new_session 26 from 27 web_pages w, 28 ( select 29 js.wl_user_id, js.wl_item_id, 30 js.wl_webpage_name, js.wl_ts 31 from web_logs 32 lateral view 33 json_tuple 34 (web_logs.line, 35 ’wl_user_id’, ’wl_item_id’, 36 ’wl_webpage_name’, ’wl_timestamp’ 37 ) js as 38 wl_user_id, wl_item_id, 39 wl_webpage_name, wl_ts 40 where 41 js.wl_user_id is not null 42 and js.wl_item_id is not null 43 ) logs 44 where 45 logs.wl_webpage_name = w.w_web_page_name 46 cluster by uid 47 ) sessionize 48 cluster by sessionid, uid, tstamp; 49 50 select 51 c.c_user_id, 52 c. c_name , 53 count(*) as cnt_se 54 from 55 sessions s, 56 user c 57 where 58 c.c_user_id = s.uid 59 group by c_user_id, c_name

D.3 BigBench Q07 309 60 having cnt_se > 10 61 order by cnt_se desc 62 limit 50;

D.4 BigBench Q09

1 drop view if exists sessions; 2 3 create view sessions as 4 select 5 uid, tstamp, 6 concat(sessionize.uid, 7 concat(’_’, sum(new_session) 8 over (partition by sessionize.uid 9 order by sessionize.tstamp)) 10 ) as session_id 11 from ( 12 select 13 logs.wl_user_id asuid, 14 unix_timestamp(logs.wl_ts) as tstamp, 15 case 16 when (unix_timestamp(logs.wl_ts) 17 - lag (unix_timestamp(logs.wl_ts)) 18 over (partition by logs.wl_user_id 19 order by logs.wl_ts)) >= 600 20 then 1 21 else 0 22 end as new_session 23 from 24 web_logs 25 lateral view 26 json_tuple 27 (web_logs.line, ’wl_user_id’, 28 ’wl_item_id’, ’wl_timestamp’ 29 ) logs as 30 wl_user_id, wl_item_id, wl_ts 31 where 32 logs.wl_user_id is not null 33 cluster by uid 34 ) sessionize 35 cluster by sessionid, uid, tstamp; 36 37 select 38 c_user_id , 39 c_name , 40 count(*)/24 as cnt 41 from 42 sessions s, 43 user c 44 where 45 s.uid = c.c_user_id 46 group by c.c_user_id, c.c_name 47 order by cnt desc 48 limit 10;

D.5 BigBench Q13

1 drop view if exists sessions;

310 Chapter D BigBench V2 - New Queries Implementation 2 3 create view sessions as 4 select 5 uid , 6 sessionid , 7 min(tstamp) as startTime, 8 max(tstamp) as endTime 9 from ( 10 select 11 uid, tstamp, 12 concat(sessionize.uid, 13 concat(’_’, sum(new_session) 14 over (partition by sessionize.uid 15 order by sessionize.tstamp)) 16 ) as session_id 17 from ( 18 select 19 logs.wl_user_id as uid, 20 unix_timestamp(logs.wl_ts) as tstamp, 21 case 22 when (unix_timestamp(logs.wl_ts) 23 - lag (unix_timestamp(logs.wl_ts)) 24 over (partition by logs.wl_user_id 25 order by logs.wl_ts)) >= 600 26 then 1 27 else 0 28 end as new_session 29 from ( 30 select 31 js.wl_user_id, js.wl_item_id, 32 js. wl_ts 33 from web_logs 34 lateral view 35 json_tuple 36 (web_logs.line, ’wl_user_id’, 37 ’wl_item_id’, ’wl_timestamp’ 38 ) js as 39 wl_user_id, wl_item_id, wl_ts 40 where 41 js.wl_user_id is not null 42 and js.wl_item_id is not null 43 ) logs 44 cluster by uid 45 ) sessionize 46 cluster by uid, session_id 47 ) temp 48 group by uid, sessionid; 49 50 select 51 avg(s.endTime-s.startTime) 52 from 53 sessions s;

D.6 BigBench Q14

1 select 2 purchase_year, 3 avg(items_per_user) 4 from 5 ( select

D.6 BigBench Q14 311 6 userid as userid, 7 year(to_date(dates[size_dates -1])) 8 as purchase_year, 9 sum(cart_items) as items_per_user 10 from matchpath 11 (on 12 ( select 13 js.wl_user_id, 14 js.wl_product_id, 15 js.wl_webpage_name, 16 js. wl_ts 17 from web_logs 18 lateral view 19 json_tuple 20 (web_logs.line, 21 ’wl_user_id’, ’wl_product_id’, 22 ’wl_webpage_name’, ’wl_timestamp’ 23 ) js as 24 wl_user_id, wl_product_id, 25 wl_webpage_name, wl_ts 26 where 27 js.wl_user_id is not null 28 ) n_logs 29 partition by wl_user_id 30 order by wl_ts 31 arg1(’A+.B’), arg2(’A’), 32 arg3(wl_webpage_name in (’webpage#01’, 33 ’webpage#02’,’webpage#03’,’webpage#04’, 34 ’webpage#05’,’webpage#06’,’webpage#07’, 35 ’webpage#08’,’webpage#09’,’webpage#10’, 36 ’webpage#11’,’webpage#12’,’webpage#13’, 37 ’webpage#14’,’webpage#15’,’webpage#16’, 38 ’webpage#17’,’webpage#18’,’webpage#19’, 39 ’webpage#20’)), 40 arg4 (’B ’), 41 arg5(wl_webpage_name in(’webpage#21’, 42 ’webpage#22’,’webpage#23’,’webpage#24’, 43 ’webpage#25’)), 44 arg6(’tpath[0].wl_user_id as userid, 45 (size(tpath.wl_product_id)-1) 46 as cart_items, 47 tpath.wl_ts as dates, 48 size(tpath.wl_ts) as size_dates’) 49 ) group by userid, 50 cart_items, 51 dates[size_dates -1] 52 ) as t 53 group by purchase_year 54 order by purchase_year;

D.7 BigBench Q16

1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 web_logs 6 lateral view 7 json_tuple 8 (web_logs.line,

312 Chapter D BigBench V2 - New Queries Implementation 9 wl_webpage_name 10 ) logs as wl_webpage_name 11 where 12 wl_webpage_name is not null 13 group by wl_webpage_name 14 order by cnt desc 15 limit 10;

D.8 BigBench Q17

1 select 2 wl_webpage_name, 3 count(*) as cnt 4 from 5 web_logs 6 lateral view 7 json_tuple 8 (web_logs.line, 9 ’wl_webpage_name’, ’wl_timestamp’ 10 ) logs as wl_webpage_name, wl_timestamp 11 where 12 wl_webpage_name is not null 13 and 14 to_date(wl_timestamp) >= ’2013-02-14’ 15 and to_date(wl_timestamp) < ’2014-02-15’ 16 group by wl_webpage_name 17 order by cnt desc 18 limit 10;

D.9 BigBench Q19

1 select 2 day(to_date(wl_timestamp)) as d, 3 month(to_date(wl_timestamp)) as m, 4 year(to_date(wl_timestamp)) as y, 5 count(*) as PageViews 6 from 7 web_logs 8 lateral view 9 json_tuple 10 (web_logs.line, 11 ’wl_timestamp’ 12 ) logs as wl_timestamp 13 group by wl_timestamp 14 order by PageViews desc 15 limit 10;

D.10 BigBench Q20

1 drop view if exists temp1; 2 3 create view temp1 as 4 select 5 c.c_user_id as o_user, 6 sum(ws.ws_quantity * i.i_price) 7 as online_revenue

D.8 BigBench Q17 313 8 from 9 web_sales ws, 10 user c, 11 product i 12 where ws.ws_user_id = c.c_user_id 13 and ws.ws_user_id is not null 14 and ws.ws_product_id = i.i_product_id 15 group by c.c_user_id 16 order by c.c_user_id asc; 17 18 drop view if exists temp2; 19 20 create view temp2 as 21 select 22 c.c_user_id as i_user, 23 sum(ss.ss_quantity * i.i_price) 24 as instore_revenue 25 from 26 store_sales ss, 27 user c, 28 product i 29 where ss.ss_user_id = c.c_user_id 30 and ss.ss_user_id is not null 31 and ss.ss_product_id = i.i_product_id 32 group by c.c_user_id 33 order by c.c_user_id asc; 34 35 drop table if exists q20_results; 36 37 create table q20_results ( 38 online_segment bigint, 39 instore_segment bigint 40 ) 41 row format 42 delimited fields terminated by ’,’ 43 lines terminated by ’\n’ 44 stored as textfile; 45 46 insert into table q20_results 47 select 48 sum ( case 49 when t1.online_revenue 50 >= t2.instore_revenue 51 then 1 52 else 0 end) as online_revenue, 53 sum ( case 54 when t1.online_revenue 55 < t2.instore_revenue 56 then 1 57 else 0 end) as instore_revenue 58 from user c join temp1 t1 59 on c.c_user_id = t1.o_user 60 join temp2 t2 61 on c.c_user_id = t2.i_user;

D.11 BigBench Q21

1 select 2 path_to_purchase, 3 count(*) as freq

314 Chapter D BigBench V2 - New Queries Implementation 4 from matchpath 5 (on 6 ( select 7 js.wl_user_id, 8 js.wl_product_id, 9 js.wl_webpage_name, 10 js. wl_ts 11 from web_logs 12 lateral view 13 json_tuple 14 (web_logs.line, 15 ’wl_user_id’, ’wl_product_id’, 16 ’wl_webpage_name’, ’wl_timestamp’ 17 ) js as 18 wl_user_id, wl_product_id, 19 wl_webpage_name, wl_ts 20 where 21 js.wl_user_id is not null 22 ) n_logs 23 partition by wl_user_id 24 order by wl_ts 25 arg1(’other+.purchase’), 26 arg2(’other’), 27 arg3(wl_webpage_name not in (’webpage#21’, 28 ’webpage#22’,’webpage#23’,’webpage#24’, 29 ’webpage#25’)), 30 arg4(’purchase’), 31 arg5(wl_webpage_name in (’webpage#21’, 32 ’webpage#22’,’webpage#23’,’webpage#24’, 33 ’webpage#25’)), 34 arg6(’tpath.wl_webpage_name 35 as path_to_purchase’) 36 ) 37 group by path_to_purchase 38 order by freq desc 39 limit 5;

D.12 BigBench Q22

1 select 2 day(to_date(wl_timestamp)) as d, 3 month(to_date(wl_timestamp)) as m, 4 year(to_date(wl_timestamp)) as y, 5 count(distinct wl_user_id) as uniqueVisitors 6 from 7 web_logs 8 lateral view 9 json_tuple 10 (web_logs.line, 11 ’wl_user_id’, ’wl_timestamp’ 12 ) l as wl_user_id, wl_timestamp 13 where 14 wl_user_id is not null 15 group by wl_timestamp 16 order by uniqueVisitors desc 17 limit 10;

D.12 BigBench Q22 315 D.13 BigBench Q23

1 select 2 c_user_id , 3 c_name , 4 count(*) as visits 5 from 6 ( select 7 lg.wl_user_id 8 from web_logs wl 9 lateral view 10 json_tuple 11 (wl.line, ’wl_user_id’ 12 ) lg as wl_user_id 13 where 14 lg.wl_user_id is not null 15 ) l, 16 user 17 where 18 l.wl_user_id = c_user_id 19 group by c_user_id, c_name 20 order by visits desc 21 limit 10;

316 Chapter D BigBench V2 - New Queries Implementation Evaluating Hive and Spark SQL E with BigBench V2

317 Fig. E.1.: BigBench V2 Hive LateBinding results Evaluating Hive and Spark SQL with BigBench V2 Chapter E 318 Fig. E.2.: BigBench V2 Hive LateBinding Normalized results with respect to SF1 Fig. E.3.: BigBench V2 Hive LateBinding results 319

Fig. E.4.: BigBench V2 Hive LateBinding Normalized results with respect to SF1

Performance Evaluation of Spark F SQL using BigBench

The following figures depict the CPU, network and disk utilization for queries Q4, Q5, Q7, Q9, Q18, Q24 and Q27 resulted from the experiments in section 3.4.

Fig. F.1.: CPU Utilization of queries Q4, Q5, Q18 and Q27 on Hive for scale factor 1TB.

321 Fig. F.2.: Network Utilization of queries Q4, Q5, Q18 and Q27 on Hive for scale factor 1TB.

Fig. F.3.: Disk Utilization of queries Q4, Q5, Q18 and Q27 on Hive for scale factor 1TB.

322 Chapter F Performance Evaluation of Spark SQL using BigBench Fig. F.4.: Resource Utilization of query Q7 on Hive and Spark SQL for scale factor 1TB.

323 Fig. F.5.: Resource Utilization of query Q9 on Hive and Spark SQL for scale factor 1TB.

324 Chapter F Performance Evaluation of Spark SQL using BigBench Fig. F.6.: Resource Utilization of query Q24 on Hive and Spark SQL for scale factor 1TB.

325

The Influence of Columnar File G Formats on SQL-on-Hadoop Engine Performance

G.1 BigBench Q08 (MapReduce/Python)

1 For online sales, compare the total sales monetary amount in 2 which customers checked online reviews 3 days before making 3 the purchase and that of sales in which customers did not read 4 reviews . 5 Consider only online sales for a specific category in a given year. 6 7 -- Resources 8 ADD FILE ${hiveconf:QUERY_DIR}/q08_filter_sales_with_reviews_viewed 9 _before.py; 10 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE2}; 11 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE3}; 12 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE1}; 13 14 --DateFilter 15 CREATE TABLE ${hiveconf:TEMP_TABLE1} AS 16 SELECT d_date_sk 17 FROM date_dim d 18 WHERE d.d_date >= ’${hiveconf:q08_startDate}’ 19 AND d.d_date <= ’${hiveconf:q08_endDate}’; 20 21 --PART 1 -- 22 CREATE TABLE ${hiveconf:TEMP_TABLE2} AS 23 SELECT DISTINCT wcs_sales_sk 24 FROM( 25 FROM( 26 SELECT wcs_user_sk, 27 (wcs_click_date_sk * 86400L + wcs_click_time_sk) 28 AS tstamp_inSec, 29 wcs_sales_sk, 30 wp_type 31 FROM web_clickstreams 32 LEFT SEMI JOIN ${hiveconf:TEMP_TABLE1} date_filter 33 ON (wcs_click_date_sk = date_filter.d_date_sk and wcs_user_sk 34 ISNOTNULL) 35 JOIN web_page w ON wcs_web_page_sk = w.wp_web_page_sk 36 DISTRIBUTE BY wcs_user_sk 37 SORT BY wcs_user_sk,tstamp_inSec,wcs_sales_sk,wp_type 38 ) q08_map_output 39 -- input: web_clicks in a given year 40 REDUCE wcs_user_sk, 41 tstamp_inSec, 42 wcs_sales_sk, 43 wp_type 44 USING ’python q08_filter_sales_with_reviews_viewed_before.py 45 review ${hiveconf:q08_seconds_before_purchase}’

327 46 AS (wcs_sales_sk BIGINT) 47 ) sales_which_read_reviews; 48 49 --PART 2 -- 50 CREATE TABLE IF NOT EXISTS ${hiveconf:TEMP_TABLE3} AS 51 SELECT ws_net_paid, ws_order_number 52 FROM web_sales ws 53 JOIN ${hiveconf:TEMP_TABLE1} d ON ( ws.ws_sold_date_sk = d.d_date_sk); 54 55 --PART 3 56 --CREATE RESULT TABLE. Store query result externally in output_dir/qXXresult/ 57 DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; 58 CREATE TABLE ${hiveconf:RESULT_TABLE} ( 59 q08_review_sales_amount BIGINT, 60 no_q08_review_sales_amount BIGINT 61 ) 62 ROW FORMAT DELIMITED FIELDS TERMINATED BY ’,’ LINES TERMINATED BY ’\n’ 63 STORED AS ${env:BIG_BENCH_hive_default_fileformat_result_table} LOCATION 64 ’${hiveconf:RESULT_DIR}’; 65 66 -- the real query part-- 67 INSERT INTO TABLE ${hiveconf:RESULT_TABLE} 68 SELECT 69 q08_review_sales.amount AS q08_review_sales_amount, 70 q08_all_sales.amount - q08_review_sales.amount AS no_q08_review_sales_amount 71 FROM( 72 SELECT 1 AS id, SUM(ws_net_paid) as amount 73 FROM ${hiveconf:TEMP_TABLE3} allSalesInYear 74 LEFT SEMI JOIN ${hiveconf:TEMP_TABLE2} salesWithViewedReviews 75 ON allSalesInYear.ws_order_number = salesWithViewedReviews.wcs_sales_sk 76 ) q08_review_sales 77 JOIN( 78 SELECT 1 AS id, SUM(ws_net_paid) as amount 79 FROM ${hiveconf:TEMP_TABLE3} allSalesInYear 80 ) q08_all_sales 81 ON q08_review_sales.id = q08_all_sales.id; 82 83 --cleanup-- 84 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE2}; 85 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE3}; 86 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE1};

G.2 BigBench Q10 (HiveQL/OpenNLP)

1 For all products, extract sentences from its product reviews that 2 contain positive or negative sentiment and display for each item 3 the sentiment polarity of the extracted sentences (POS OR NEG) and 4 the sentence and word in sentence leading to this classification. 5 6 -- Resources 7 ADD JAR ${env:BIG_BENCH_QUERIES_DIR}/Resources/opennlp-maxent-3.0.3.jar; 8 ADD JAR ${env:BIG_BENCH_QUERIES_DIR}/Resources/opennlp-tools-1.6.0.jar; 9 ADD JAR ${env:BIG_BENCH_QUERIES_DIR}/Resources/bigbenchqueriesmr.jar; 10 CREATE TEMPORARY FUNCTION extract_sentiment AS 11 ’io.bigdatabenchmark.v1.queries.q10.SentimentUDF ’; 12 13 -- CREATE RESULT TABLE. Store query result externally in output_dir/qXXresult/ 14 DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; 15 CREATE TABLE ${hiveconf:RESULT_TABLE} ( 16 item_sk BIGINT,

328 Chapter G The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 17 review_sentence STRING, 18 sentiment STRING, 19 sentiment_word STRING 20 ) 21 ROW FORMAT DELIMITED FIELDS TERMINATED BY ’,’ LINES TERMINATED BY ’\n’ 22 STORED AS ${env:BIG_BENCH_hive_default_fileformat_result_table} 23 LOCATION ’${hiveconf:RESULT_DIR}’; 24 25 -- the real query part 26 INSERT INTO TABLE ${hiveconf:RESULT_TABLE} 27 SELECT item_sk, review_sentence, sentiment, sentiment_word 28 FROM( 29 SELECT extract_sentiment(pr_item_sk, pr_review_content) 30 AS (item_sk, review_sentence, sentiment, sentiment_word) 31 FROM product_reviews 32 ) extracted 33 ORDER BY item_sk,review_sentence,sentiment,sentiment_word;

G.3 BigBench Q12 (Pure HiveQL)

1 Find all customers who viewed items of a given category on the web 2 in a given month and year that was followed by an in-store purchase 3 of an item from the same category in the three consecutive months. 4 5 DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; 6 CREATE TABLE ${hiveconf:RESULT_TABLE} ( 7 u_id BIGINT 8 ) 9 ROW FORMAT DELIMITED FIELDS TERMINATED BY ’,’ LINES TERMINATED BY ’\n’ 10 STORED AS ${env:BIG_BENCH_hive_default_fileformat_result_table} LOCATION 11 ’${hiveconf:RESULT_DIR}’; 12 13 INSERT INTO TABLE ${hiveconf:RESULT_TABLE} 14 SELECT DISTINCT wcs_user_sk -- Find all customers 15 FROM 16 ( -- web_clicks viewed items in date range with items from 17 specified categories 18 SELECT 19 wcs_user_sk, 20 wcs_click_date_sk 21 FROM web_clickstreams, item 22 WHERE wcs_click_date_sk BETWEEN 37134 AND (37134 + 30) 23 -- in a given month and year 24 AND i_category IN (${hiveconf:q12_i_category_IN}) 25 -- filter given category 26 AND wcs_item_sk = i_item_sk 27 AND wcs_user_sk IS NOT NULL 28 AND wcs_sales_sk IS NULL --only views, not purchases 29 ) webInRange, 30 ( -- store sales in date range with items from specified categories 31 SELECT 32 ss_customer_sk, 33 ss_sold_date_sk 34 FROM store_sales, item 35 WHERE ss_sold_date_sk BETWEEN 37134 AND (37134 + 90) 36 -- in the three consecutive months. 37 AND i_category IN (${hiveconf:q12_i_category_IN}) 38 -- filter given category 39 AND ss_item_sk = i_item_sk 40 AND ss_customer_sk IS NOT NULL

G.3 BigBench Q12 (Pure HiveQL) 329 41 ) storeInRange -- join web and store 42 WHERE wcs_user_sk = ss_customer_sk 43 AND wcs_click_date_sk < ss_sold_date_sk 44 -- buy AFTER viewed on website 45 ORDER BY wcs_user_sk;

G.4 BigBench Q25 (HiveQL/SparkMLlib)

1 Customer segmentation analysis: Customers are separated along 2 the following key shopping dimensions: recency of last visit, 3 frequency of visits and monetary amount. Use the store and online 4 purchase data during a given year to compute. After model of 5 separation is build, report for the analysed customers to which 6 "group" they where assigned. 7 8 DROP TABLE IF EXISTS ${hiveconf:TEMP_TABLE}; 9 CREATE TABLE ${hiveconf:TEMP_TABLE} ( 10 cid BIGINT , 11 frequency BIGINT, 12 most_recent_date BIGINT, 13 amount decimal(15,2) 14 ); 15 16 -- Add store sales data 17 INSERT INTO TABLE ${hiveconf:TEMP_TABLE} 18 SELECT 19 ss_customer_sk AScid, 20 count(distinct ss_ticket_number) AS frequency, 21 max(ss_sold_date_sk) AS most_recent_date, 22 SUM(ss_net_paid) ASamount 23 FROM store_sales ss 24 JOIN date_dim d ON ss.ss_sold_date_sk = d.d_date_sk 25 WHERE d.d_date > ’${hiveconf:q25_date}’ 26 AND ss_customer_sk IS NOT NULL 27 GROUP BY ss_customer_sk; 28 29 -- Add web sales data 30 INSERT INTO TABLE ${hiveconf:TEMP_TABLE} 31 SELECT 32 ws_bill_customer_sk AS cid, 33 count(distinct ws_order_number) AS frequency, 34 max(ws_sold_date_sk) AS most_recent_date, 35 SUM(ws_net_paid) AS amount 36 FROM web_sales ws 37 JOIN date_dim d ON ws.ws_sold_date_sk = d.d_date_sk 38 WHERE d.d_date > ’${hiveconf:q25_date}’ 39 AND ws_bill_customer_sk IS NOT NULL 40 GROUP BY ws_bill_customer_sk; 41 42 --ML-algorithms expect double values as input for their Vectors. 43 DROP TABLE IF EXISTS ${hiveconf:TEMP_RESULT_TABLE}; 44 CREATE TABLE ${hiveconf:TEMP_RESULT_TABLE} ( 45 cid BIGINT, --used as "label", 46 all following values are used as Vector for ML-algorithm 47 recency double, 48 frequency double, 49 totalspend double 50 ); 51 52 INSERT INTO TABLE ${hiveconf:TEMP_RESULT_TABLE}

330 Chapter G The Influence of Columnar File Formats on SQL-on-Hadoop Engine Performance 53 SELECT 54 cid AS cid , 55 CASE WHEN 37621 - max(most_recent_date) < 60 THEN 1.0 ELSE 0.0 END 56 AS recency, -- 37621 == 2003-01-02 57 SUM(frequency) AS frequency, --total frequency 58 SUM(amount) AS totalspend --total amount 59 FROM ${hiveconf:TEMP_TABLE} 60 GROUP BY cid 61 ORDER BY cid; 62 63 ---CLEANUP-- 64 DROP TABLE ${hiveconf:TEMP_TABLE};

G.4 BigBench Q25 (HiveQL/SparkMLlib) 331 List of Figures

0.1 Generalisierte Hybride Benchmark-Methodik (Iterativer experimenteller Ansatz), verwendet im Kapitel 3...... vi 0.2 Die Hybrid Benchmark Methodik integriert mit den verschiedenen Big Data Architecture Layern, die auf die Heterogeneity levels abgebildet sind. x 0.3 ABench Roadmap ...... xi

1.1 Generalized Hybrid Benchmark Methodology (Iterative Experimental Approach) used in Chapter 3 ...... 5 1.2 Thesis Structure (Chapters) ...... 8

2.1 Visualization of the Extended V-Model (adopted from (E. G. Caldarola, Sacco, & Terkaj, 2014)) ...... 17 2.2 Visualization of the 3V Cube Model) ...... 18

3.1 The Hybrid Benchmark Methodology integrated with the different Big Data Architecture Layers mapped to the heterogeneity levels...... 41 3.2 Options for Virtualized Hadoop Cluster Deployments ...... 43 3.3 Experimental Platform Layers ...... 45 3.4 Standard and Data-Compute Hadoop Cluster Configurations ...... 46 3.5 Iterative Experimental Approach ...... 49 3.6 Normalized WordCount Completion Times ...... 51 3.7 WordCount Time (Seconds) ...... 52 3.8 WordCount Throughput (MBs per second) ...... 53 3.9 WordCount Time Difference between Standard1 (Baseline) and all Other Configurations in % ...... 54 3.10 WordCount Data Scaling Behavior of all Cluster Configurations normal- ized to Standrad1 ...... 54 3.11 Normalized DFSIO Read Completion Times ...... 57 3.12 Normalized DFSIO Write Completion Times ...... 57 3.13 DFSIO Read Time (Seconds) ...... 59 3.14 DFSIO Read Throughput (MBs per second) ...... 59 3.15 DFSIO Read Time Difference between Standard1 (Baseline) and all Other Configurations in % ...... 60 3.16 DFSIO Read Data Behavior of all Cluster Configurations normalized to Standrad1 ...... 61 3.17 DFSIO Write Time (Seconds) ...... 61 3.18 DFSIO Write Throughput (MBs per second) ...... 61 3.19 DFSIO Write Time ∆ (%) between Standard1 (Baseline) and all other configurations in % ...... 62 3.20 DFSIO Write Data Behavior of all Cluster Configurations normalized to Standrad1 ...... 63 3.21 Benchmarking Methodology Process Diagram ...... 69

332 3.22 WordCount – Processing Different Data Sizes ...... 72 3.23 WordCount – Data Scaling Behavior ...... 73 3.24 Enhanced DFSIO – Reading Different Data Sizes ...... 74 3.25 Enhanced DFSIO – Writing Different Data Sizes ...... 75 3.26 Enhanced DFSIO – Data Scaling Behavior ...... 76 3.27 HiveBench Join for Different Data Sizes ...... 79 3.28 HiveBench – Data Scaling Behavior ...... 79 3.29 Cluster Setup ...... 84 3.30 TPCx-HS Execution Phases (version 1.3.0 from February 19, 2015) [TPCb] 87 3.31 TPCx-HS Execution Times and Metric ...... 90 3.32 TPCx-HS Scaling Behavior (0 on the X and Y-axis is equal to the baseline of SF 0.1/100GB) ...... 91 3.33 Master Node CPU Utilization ...... 94 3.34 Worker Node CPU Utilization ...... 95 3.35 Master Node Context Switches ...... 95 3.36 Worker Node Context Switches ...... 96 3.37 Master Node Memory Utilization ...... 96 3.38 Master Node Free Memory ...... 97 3.39 Worker Node Memory Utilization ...... 97 3.40 Worker Node Free Memory ...... 98 3.41 Master Node I/O Requests ...... 99 3.42 Worker Node I/O Requests ...... 99 3.43 Master Node I/O Latencies ...... 100 3.44 Worker Node I/O Latencies ...... 100 3.45 Master Node Disk Bandwidth ...... 101 3.46 Worker Node Disk Bandwidth ...... 101 3.47 Master Node Network I/O ...... 102 3.48 Worker Node Network I/O ...... 102 3.49 Worker Node JVM Count ...... 103 3.50 BigBench Schema [Cho+13b] ...... 105 3.51 Improvements between Initial and Final Cluster Configuration for 1TB data size ...... 108 3.52 BigBench + MapReduce Query Times normalized with respect to 100GB SF ...... 110 3.53 BigBench + Spark SQL Query Times normalized with respect to 100GB SF ...... 112 3.54 Hive to Spark SQL Query Time Ratio defined as ((HiveTime*100)/SparkTime)- 100) ...... 113 3.55 Graphical representation of the proposed benchmarking approach. . . 121 3.56 General structure of a columnar file...... 124 3.57 Hive Load Time (min.) on the X-axis and Data Size (GB) on the Y-axis for Scale Factor 1000...... 134 3.58 The picture illustrates the process of collecting the Spark History Server metrics and source code details into a summary table for query Q08 executed on ORC configured with No Compression...... 147

List of Figures 333 3.59 CPU utilization for Q08 with No Compression configuration...... 149 3.60 Disk Requests for Q08 with Parquet...... 150 3.61 Disk requests for Q10 with No Compression configuration...... 152 3.62 CPU utilization for Q12 with Parquet...... 154 3.63 CPU utilization for Q25 with Snappy configuration...... 155

4.1 ABench Roadmap ...... 162 4.2 Data model ...... 167 4.3 Data generation process flow ...... 169 4.4 BigBench V2 Hive results for SF1 ...... 176 4.5 BigBench V2 engines results for SF1 ...... 179 4.6 BigBench V2 Hive and Spark SQL Total Execution Times (in seconds) . 191 4.7 Streaming Window Cases ...... 195 4.8 Current vs. New BigBench Components ...... 196 4.9 Active and Passive Streaming Architecture ...... 200 4.10 Structured Streaming Models [Gui18] ...... 206 4.11 Trigger Process ...... 207 4.12 Latency Distribution ...... 213 4.13 Optimal Trigger Processing Time ...... 213 4.14 Abstract Big Data Stack ...... 217 4.15 Types of Benchmarks ...... 218

5.1 Implementation of the BigBench V2 Batch Architecture in ABench . . . 224 5.2 Implementation of the BigBench V2 Streaming Architecture in ABench 225

A.1 Big Data Benchmarks Classification ...... 268

E.1 BigBench V2 Hive LateBinding results ...... 318 E.2 BigBench V2 Hive LateBinding Normalized results with respect to SF1 318 E.3 BigBench V2 Hive LateBinding results ...... 319 E.4 BigBench V2 Hive LateBinding Normalized results with respect to SF1 319

F.1 CPU Utilization of queries Q4, Q5, Q18 and Q27 on Hive for scale factor 1TB...... 321 F.2 Network Utilization of queries Q4, Q5, Q18 and Q27 on Hive for scale factor 1TB...... 322 F.3 Disk Utilization of queries Q4, Q5, Q18 and Q27 on Hive for scale factor 1TB...... 322 F.4 Resource Utilization of query Q7 on Hive and Spark SQL for scale factor 1TB...... 323 F.5 Resource Utilization of query Q9 on Hive and Spark SQL for scale factor 1TB...... 324 F.6 Resource Utilization of query Q24 on Hive and Spark SQL for scale factor 1TB...... 325

334 List of Figures List of Tables

2.1 Big Data Cloud Providers and Services ...... 20 2.2 Classifications of Big Data Systems ...... 26 2.3 Abstract Big Data Architecture ...... 27 2.4 Management Level Component ...... 28 2.5 Platform Level Components ...... 32 2.6 Application Level Components ...... 36

3.1 Six Experimental Hadoop Cluster Configurations ...... 48 3.2 Selected HiBench Workload Characteristics ...... 48 3.3 WordCount Parameters ...... 50 3.4 WordCount - Equal Number of VMs ...... 51 3.5 WordCount - Different Number of VMs ...... 51 3.6 WordCount Standard Cluster Results ...... 53 3.7 WordCount Data-Compute Cluster Results ...... 53 3.8 Enhanced DFSIO Parameters ...... 55 3.9 Enhanced DFSIO Experiments ...... 56 3.10 DFSIO Read - Equal Number of VMs ...... 56 3.11 DFSIO Read - Different Number of VMs ...... 57 3.12 DFSIO Write - Equal Number of VMs ...... 58 3.13 DFSIO Write - Different Number of VMs ...... 58 3.14 DFSIO Standard Cluster Read Results ...... 59 3.15 DFSIO Data-Compute Cluster Read Results ...... 60 3.16 DFSIO Standard Cluster Write Results ...... 62 3.17 DFSIO Data-Compute Cluster Write Results ...... 62 3.18 Hardware Characteristics of the used Blade Nodes ...... 68 3.19 Software Characteristics of the used Blade Nodes ...... 68 3.20 WordCount Map/Reduce Experiments (240 GB) ...... 71 3.21 WordCount Results ...... 72 3.22 Enhanced DFSIO Read Results ...... 75 3.23 Enhanced DFSIO Write Results ...... 76 3.24 Enhanced DFSIO Read/Write ∆ ...... 76 3.25 HiveBench Data Generation Parameters ...... 77 3.26 HiveBench Aggregation Results ...... 77 3.27 HiveBench Join Results ...... 78 3.28 Summary of Total System Resources ...... 84 3.29 Software Stack of the System under Test ...... 85 3.30 Software Services per Node ...... 85 3.31 Network Speed ...... 86 3.32 TPCx-HS Phases ...... 88 3.33 TPCx-HS Scale Factors ...... 88 3.34 TPCx-HS Related Ratios ...... 90

335 3.35 TPCx-HS Results ...... 91 3.36 TPCx-HS Phase Times ...... 92 3.37 Total System Cost with 1GBit Switch ...... 92 3.38 Price-Performance Metrics ...... 93 3.39 BigBench Queries ...... 106 3.40 Cluster Configuration Parameters ...... 109 3.41 Average query times for the four tested scale factors (100GB, 300GB, 600GB and 1000GB). The column ∆ (%) shows the time difference in % between the baseline 100GB SFs and the other three SFs for Hive/MapReduce...... 110 3.42 Average query times for the four tested scale factors (100GB, 300GB, 600GB and 1000GB). The column ∆ (%) shows the time difference in % between the baseline 100GB SFs and the other three SFs for Spark SQL...... 111 3.43 Average Resource Utilization of queries Q4, Q5, Q18 and Q27 on Hive/MapReduce for scale factor 1TB...... 115 3.44 Average Resource Utilization of queries Q7, Q9 and Q24 on Hive and Spark SQL for scale factor 1TB. The Ratio column is defined as Hive- Time/SparkTime or Spark-Time/HiveTime and represents the difference between Hive (MapReduce) and Spark SQL for each metric...... 117 3.45 Average Resource Utilization of query Q24 on Hive and Spark SQL for scale factor 1TB. The Ratio column is defined as HiveTime/SparkTime or Spark-Time/HiveTime and represents the difference between Hive (MapReduce) and Spark SQL for each metric...... 118 3.46 Popular SQL-on-Hadoop Engines ...... 122 3.47 ORC design concepts and default configuration...... 125 3.48 Parquet design concepts and default configuration...... 126 3.49 Summary of related work...... 127 3.50 Cluster configuration...... 130 3.51 BigBench query types...... 131 3.52 File format configurations...... 132 3.53 Experimental roadmap...... 135 3.54 Hive Results for Pure HiveQL query type...... 137 3.55 Hive Results for MapReduce/Python query type...... 138 3.56 Hive Results for HiveQL/OpenNLP query type...... 139 3.57 Hive Results for HiveQL/Spark MLlib query type...... 140 3.58 Spark Results for Pure HiveQL query type...... 142 3.59 Spark Results for MapReduce/Python query type...... 143 3.60 Spark Results for HiveQL/OpenNLP query type...... 145 3.61 Spark Results for HiveQL/Spark MLlib query type...... 146

4.1 Schema of the six structured tables ...... 168 4.2 Schema of product review table...... 169 4.3 Cardinality for various scaling factors ...... 170 4.4 Technical and business query breakdown ...... 172 4.5 Data size and loading time ...... 174 4.6 Late binding vs. pre-parsed table Hive for SF1 ...... 177 4.7 Late binding overhead for other engines ...... 179 4.8 Data Sizes and Loading Times for the BigBench V2 Scale Factors . . . . 183

336 List of Tables 4.9 Average query times for the three scale factors (SF1, SF10 and SF30). The column ∆ % shows the time difference in % between the baseline SF1 and the other two SFs for the two Hive user-defined function implementations...... 186 4.10 Average query times for the three scale factors (SF60, SF100 and SF200). The column ∆ % shows the time difference in % between the baseline SF1 and the other three SFs for the two Hive user-defined function implementations...... 187 4.11 Average query times for the three scale factors (SF1, SF10 and SF30). The column ∆ % shows the time difference in % between the baseline SF1 and the other two SFs for the two Spark SQL user-defined function implementations...... 189 4.12 Average query times for the three scale factors (SF60, SF100 and SF200). The column ∆ % shows the time difference in % between the baseline SF1 and the other three SFs for the two Spark SQL user-defined function implementations...... 190 4.13 Average Execution Times (in seconds) ...... 201 4.14 Summary of the Active and Passive Mode Features ...... 202 4.15 Metrics ...... 207 4.16 Structured Streaming Pros and Cons ...... 214

5.1 Overview of the explored Workloads, Workload Type and Algorithm Name ...... 225 5.2 Overview of available Algorithms and Libraries ...... 226

A.1 Active TPC Benchmarks [TPC18a] ...... 269 A.2 Active SPEC Benchmarks [SPE18] ...... 269 A.3 Active STAC Benchmarks [STA18] ...... 270 A.4 Active LDBC Benchmarks [LDB18] ...... 271 A.5 TPCx-HS Phases ...... 275 A.6 Representative applications in CloudRank-D; Adopted from [Luo+12b] 281 A.7 Applications in CloudSuite; Adopted from [Fer+12b] ...... 281 A.8 Representative Applications in MRBS ...... 282 A.9 Pavlo’s Benchmark Queries ...... 283

B.1 Master Node Specifications ...... 301 B.2 Master Node Specifications ...... 302

C.1 Master Node - Resource Utilization ...... 305 C.2 Worker Node - Resource Utilization ...... 305

List of Tables 337