Classifying, Evaluating and Advancing Big Data Benchmarks

Total Page:16

File Type:pdf, Size:1020Kb

Classifying, Evaluating and Advancing Big Data Benchmarks Classifying, Evaluating and Advancing Big Data Benchmarks Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität in Frankfurt am Main von Todor Ivanov aus Stara Zagora Frankfurt am Main 2019 (D 30) vom Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität als Dissertation angenommen. Dekan: Prof. Dr. Andreas Bernig Gutachter: Prof. Dott. -Ing. Roberto V. Zicari Prof. Dr. Carsten Binnig Datum der Disputation: 23.07.2019 Abstract The main contribution of the thesis is in helping to understand which software system parameters mostly affect the performance of Big Data Platforms under realistic workloads. In detail, the main research contributions of the thesis are: 1. Definition of the new concept of heterogeneity for Big Data Architectures (Chapter 2); 2. Investigation of the performance of Big Data systems (e.g. Hadoop) in virtual- ized environments (Section 3.1); 3. Investigation of the performance of NoSQL databases versus Hadoop distribu- tions (Section 3.2); 4. Execution and evaluation of the TPCx-HS benchmark (Section 3.3); 5. Evaluation and comparison of Hive and Spark SQL engines using benchmark queries (Section 3.4); 6. Evaluation of the impact of compression techniques on SQL-on-Hadoop engine performance (Section 3.5); 7. Extensions of the standardized Big Data benchmark BigBench (TPCx-BB) (Section 4.1 and 4.3); 8. Definition of a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data architectures (Section 4.5). The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications. With the explosion of Artificial Intelligence (AI) and new hardware computing power, this is a first step towards a more holistic approach to benchmarking. iii Zusammenfassung Motivation Im Zeitalter von Big Data, beschrieben oft durch die so genannten 3Vs (Volume, Velocity and Variety) [Lan01; ZE+11], ist es unerlässlich, die richtigen Werkzeuge und bewährte Verfahren bei der Entwicklung von Big Data-Anwendungen einzuset- zen. Traditionell werden Benchmarking-Tools und -Methoden verwendet, um ver- schiedene Technologien sowohl in Bezug auf Leistung als auch auf Funktionalität zu vergleichen [Gra92]. Mit der wachsenden Anzahl von Open-Source- und Enterprise- Tools im Big Data-Ecosystem [Gre18] ist der Bedarf an standardisierten Big Data Benchmarks, die einen detailliert Vergleich zwischen diesen neuen Technologien ermöglichen, stark gestiegen [Che+12]. Gleichzeitig deuten die Fortschritte in der Hardwareentwicklung, wie neue Hardwarebeschleuniger und konfigurierbare Kom- ponenten [Ozd18] (z.B. NVMs (Non-Volatile Memory) [Bou+18], GPUs (Graphics Processing Unit) [Shi+18], FPGAs (Field Programmable Gate Array) [VF18; VPK18], TPUs (Tensor Processing Units) [Jou+17] und mehr), auf eine vollständige Überar- beitung des bestehenden Software-Stacks [KFS18] hin. Solche großen Änderungen in den Backend-Systemen betreffen sowohl die Verarbeitungs- als auch die Speicher- schicht. Künstliche Intelligenz [BAM19], Maschinelles Lernen [PAC18] und Deep Learning [Zha+18] nutzen die Vorteile der neuen Hardwarebeschleuniger [KFS18] und werden mit sehr hohem Tempo entwickelt und in Produktion genommen. Um die Vorteile eines neuen Software-Stacks zu optimieren und zu validieren, sind geeignete und standardisierte Big Data-Benchmarks erforderlich, die Maschinelles Lernen beinhalten. Das Ziel dieser Arbeit ist es, den Softwareentwicklern und Systemarchitekten zu helfen die effektivste Big Data-Plattform auszuwählen, die den besten Job für die Klasse der gewählten Big Data-Anwendungen leistet. Die wichtigsten Beiträge decken verschiedene relevante Aspekte ab: vom Verständnis der aktuellen Heraus- forderungen in den Big Data Plattformen über die Auswahl des am besten geeigneten Benchmarks für Stresstests der ausgewählten Big Data Technologien, bis hin zur Abstimmung der relevanten Plattformkomponenten zur effizienteren Verarbeitung und Speicherung von Daten. Vorgehensweise Die Arbeit verwendet eine neuartige Hybride Benchmark-Methodik (Kapitel 3) beste- hend aus einem Mix aus Best Practices und bestehenden Benchmark-Methoden, die in Kombination mit beliebten standardisierten Benchmarks verwendet werden. Ins- v besondere ist der Ansatz von den TPC-Benchmark-Methoden inspiriert, versucht aber zeitgleich flexibel und anpassungsfähig an die neuen Arten von Big Data-Benchmarks zu bleiben weil diese in den meisten Fällen keine systematische Methodik zur Durchführung von Experimenten bieten. Abbildung 0.1 veranschaulicht die vier Hauptphasen der Hybriden Benchmark-Methodik. Jede Phase wird im folgenden erläutert: Fig. 0.1.: Generalisierte Hybride Benchmark-Methodik (Iterativer experimenteller Ansatz), verwendet im Kapitel 3. • Phase 1 - Platform Setup: Im dieser Phase werden alle Hard- und Soft- warekomponenten installiert und konfiguriert. Dazu gehören die Installation und Konfiguration des Betriebssystems, des Netzwerks, der Programmier- Frameworks, z.B. der Java-Umgebung, und des zu testenden Big Data Systems. • Phase 2 - Workload Preparation: Der Benchmark, der zum Stresstest der Komponenten des zugrunde liegenden Big Data Systems verwendet wird, wird installiert und konfiguriert. Gleichzeitig werden die zu testenden Plat- tformkomponenten konfiguriert und für das geplante Benchmarking-Szenario vorbereitet. Alle Workload-Parameter werden definiert und die Testdaten wer- den vom Benchmark-Datengenerator erzeugt. Die erzeugten Daten werden zusammen mit den definierten Parametern als Input zur Ausführung der Arbeit- slast in Phase 3 verwendet. Zusätzlich werden Werkzeuge zur Datenerfassung von Metriken und Ressourcenauslastungsstatistiken eingesetzt. • Phase 3 - Workload-Ausführung: In Phase 3 werden die Benchmark-Exp- erimente durchgeführt. In der Regel wird jedes Experiment drei (oder mehr) Mal wiederholt, um die Repräsentativität der Ergebnisse zu gewährleisten und sicherzustellen, dass es keine Cache-Effekte oder unerwünschte Einflüsse zwischen den aufeinanderfolgenden Testdurchführungen gibt. Die durchschnit- tliche Ausführungszeit zwischen den drei Durchgängen wird zusammen mit der Standardabweichung von den drei Durchgängen als Endwert erfasst. Typischer- weise deutet ein höherer Standardabweichungsprozentsatz auf einen Fehler oder eine Fehlkonfiguration in einer Systemkomponente hin, was zu einer vi insgesamt instabilen Systemleistung führt. Vor jedem Workload-Experiment müssen die Testdaten in ihren Ausgangszustand zurückgesetzt werden, typis- cherweise durch Löschen der Daten und Erzeugen neuer in Phase 2. In einigen Fällen müssen die Plattform-Caches gelöscht werden, um einen konsistenten Zustand in jedem Experiment zu gewährleisten. • Phase 4 - Bewertung: In dieser Phase werden die Benchmark-Ergebnisse validiert, um die Korrektheit des Benchmark zu gewährleisten. Anschließend werden die Benchmark-Metriken (typischerweise Ausführungszeit und Durch- satz) und die Ressourcenauslastungsstatistik (CPU, Netzwerk, I/O und Spei- cher) ausgewertet. Grafische Darstellungen (Diagramme) der verschiedenen Metriken werden für die weitere Ergebnisermittlung und -analyse verwendet. Die vorstehend beschriebene Hybride Benchmark-Methodik kann, wie in Abbildung 0.1 dargestellt, iterativ ausgeführt werden, wobei jede Testausführung mindestens dreimal wiederholt wird (Wechsel zwischen Phase 2 und Phase 4). Neben den obligatorischen drei Testläufen bestehen weitere Gründe, bestimmte Testausführun- gen zu wiederholen. Zum Beispiel durch Variation der Komponentenparameter und Konfigurationen oder durch Bewertung der Plattformleistung unter verschiedenen Datengrößen. Variationen der oben beschriebenen Hybriden Benchmark-Methodik wurden verwendet, um die verschiedenen experimentellen Bewertungen und Vergle- iche im Kapitel 3 durchzuführen. Wichtigste Forschungsbeiträge und Zusammenfassung der Ergebnisse Der wichtigste Beitrag der Dissertation ist es, Methoden und Werkzeuge bere- itzustellen, um zu verstehen welche Softwaresysteme und welche Parameter die Leistung von Big Data Plattformen unter bestimmten Workloads am meisten beein- flussen. Im Detail sind die wichtigsten Forschungsbeiträge der Arbeit wie folgt: 1. Definition des neuen Konzepts der Heterogeneity für Big Data Architek- turen (Kapitel 2) - Dieses Kapitel stellt das Konzept der Heterogeneity in Big Data-Architekturen vor, das als interne Eigenschaft großer Datensysteme ange- sehen werden kann, die sowohl auf vertikale als auch auf generische Workloads abzielen. Es diskutiert, wie dies mit dem bestehenden Hadoop-Ökosystem verknüpft werden kann. 2. Untersuchung der Leistung von Big-Data-Systemen (z.B. Hadoop) in vir- tualisierten Umgebungen (Abschnitt 3.1) - Diese Studie untersucht die Leis- tung typischer Big Data-Anwendungen, die auf einem virtualisierten Hadoop- Cluster mit getrennten Daten- und Berechnungsschichten laufen, im Vergleich zur Standard Hadoop-Cluster-Installation. Die Experimente zeigen, dass ver- schiedene Hadoop-Konfigurationen, die die gleichen virtualisierten Ressourcen verwenden, zu einer unterschiedlichen Leistung führen können. Basierend auf den experimentellen Ergebnissen werden drei wichtige Einflussfaktoren identifiziert. vii 3. Untersuchung der Performance von NoSQL-Datenbanken im Vergleich zu Hadoop Distributionen (Abschnitt 3.2) - Diese experimentelle Arbeit ver- gleicht Hadoop mit einer repräsentativen NoSQL-Datenbank (Cassandra), die eine ähnliche Speicherschnittstelle wie das
Recommended publications
  • Java Linksammlung
    JAVA LINKSAMMLUNG LerneProgrammieren.de - 2020 Java einfach lernen (klicke hier) JAVA LINKSAMMLUNG INHALTSVERZEICHNIS Build ........................................................................................................................................................... 4 Caching ....................................................................................................................................................... 4 CLI ............................................................................................................................................................... 4 Cluster-Verwaltung .................................................................................................................................... 5 Code-Analyse ............................................................................................................................................. 5 Code-Generators ........................................................................................................................................ 5 Compiler ..................................................................................................................................................... 6 Konfiguration ............................................................................................................................................. 6 CSV ............................................................................................................................................................. 6 Daten-Strukturen
    [Show full text]
  • Declarative Languages for Big Streaming Data a Database Perspective
    Tutorial Declarative Languages for Big Streaming Data A database Perspective Riccardo Tommasini Sherif Sakr University of Tartu Unversity of Tartu [email protected] [email protected] Emanuele Della Valle Hojjat Jafarpour Politecnico di Milano Confluent Inc. [email protected] [email protected] ABSTRACT sources and are pushed asynchronously to servers which are The Big Data movement proposes data streaming systems to responsible for processing them [13]. tame velocity and to enable reactive decision making. However, To facilitate the adoption, initially, most of the big stream approaching such systems is still too complex due to the paradigm processing systems provided their users with a set of API for shift they require, i.e., moving from scalable batch processing to implementing their applications. However, recently, the need for continuous data analysis and pattern detection. declarative stream processing languages has emerged to simplify Recently, declarative Languages are playing a crucial role in common coding tasks; making code more readable and main- fostering the adoption of Stream Processing solutions. In partic- tainable, and fostering the development of more complex appli- ular, several key players introduce SQL extensions for stream cations. Thus, Big Data frameworks (e.g., Flink [9], Spark [3], 1 processing. These new languages are currently playing a cen- Kafka Streams , and Storm [19]) are starting to develop their 2 3 4 tral role in fostering the stream processing paradigm shift. In own SQL-like approaches (e.g., Flink SQL , Beam SQL , KSQL ) this tutorial, we give an overview of the various languages for to declaratively tame data velocity. declarative querying interfaces big streaming data.
    [Show full text]
  • Apache Apex: Next Gen Big Data Analytics
    Apache Apex: Next Gen Big Data Analytics Thomas Weise <[email protected]> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14th 2016 Stream Data Processing Data Delivery Transform / Analytics Real-time visualization, … Declarative SQL API Data Beam Beam SAMOA Operator SAMOA DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2 Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring 3 Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster •
    [Show full text]
  • Horn: a System for Parallel Training and Regularizing of Large-Scale Neural Networks
    Horn: A System for Parallel Training and Regularizing of Large-Scale Neural Networks Edward J. Yoon [email protected] I Am ● Edward J. Yoon ● Member and Vice President of Apache Software Foundation ● Committer, PMC, Mentor of ○ Apache Hama ○ Apache Bigtop ○ Apache Rya ○ Apache Horn ○ Apache MRQL ● Keywords: big data, cloud, machine learning, database What is Apache Software Foundation? The Apache Software Foundation is an Non-profit foundation that is dedicated to open source software development 1) What Apache Software Foundation is, 2) Which projects are being developed, 3) What’s HORN? 4) and How to contribute them. Apache HTTP Server (NCSA HTTPd) powers nearly 500+ million websites (There are 644 million websites on the Internet) And Now! 161 Top Level Projects, 108 SubProjects, 39 Incubating Podlings, 4700+ Committers, 550 ASF Members Unknown number of developers and users Domain Diversity Programming Language Diversity Which projects are being developed? What’s HORN? ● Oct 2015, accepted as Apache Incubator Project ● Was born from Apache Hama ● A System for Deep Neural Networks ○ A neuron-level abstraction framework ○ Written in Java :/ ○ Works on distributed environments Apache Hama 1. K-means clustering Hama is 1,000x faster than Apache Mahout At UT Arlington & Oracle 2013 2. PageRank on 10 Billion edges Graph Hama is 3x faster than Facebook’s Giraph At Samsung Electronics (Yoon & Kim) 2015 3. Top-k Set Similarity Joins on Flickr Hama is clearly faster than Apache Spark At IEEE 2015 (University of Melbourne) Why we do this? 1. How to parallelize the training of large models? 2. How to avoid overfitting due to large size of the network, even with large datasets? JonathanNet Distributed Training Parameter Server Parameter Server Parameter Swapping Task 5 Each group performs Task 2 Task 4 Task 3 ..
    [Show full text]
  • Return of Organization Exempt from Income
    OMB No. 1545-0047 Return of Organization Exempt From Income Tax Form 990 Under section 501(c), 527, or 4947(a)(1) of the Internal Revenue Code (except black lung benefit trust or private foundation) Open to Public Department of the Treasury Internal Revenue Service The organization may have to use a copy of this return to satisfy state reporting requirements. Inspection A For the 2011 calendar year, or tax year beginning 5/1/2011 , and ending 4/30/2012 B Check if applicable: C Name of organization The Apache Software Foundation D Employer identification number Address change Doing Business As 47-0825376 Name change Number and street (or P.O. box if mail is not delivered to street address) Room/suite E Telephone number Initial return 1901 Munsey Drive (909) 374-9776 Terminated City or town, state or country, and ZIP + 4 Amended return Forest Hill MD 21050-2747 G Gross receipts $ 554,439 Application pending F Name and address of principal officer: H(a) Is this a group return for affiliates? Yes X No Jim Jagielski 1901 Munsey Drive, Forest Hill, MD 21050-2747 H(b) Are all affiliates included? Yes No I Tax-exempt status: X 501(c)(3) 501(c) ( ) (insert no.) 4947(a)(1) or 527 If "No," attach a list. (see instructions) J Website: http://www.apache.org/ H(c) Group exemption number K Form of organization: X Corporation Trust Association Other L Year of formation: 1999 M State of legal domicile: MD Part I Summary 1 Briefly describe the organization's mission or most significant activities: to provide open source software to the public that we sponsor free of charge 2 Check this box if the organization discontinued its operations or disposed of more than 25% of its net assets.
    [Show full text]
  • Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]
    Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] Mostly compatible with Hadoop/HDFS Apache Drill - provides low latency ad-hoc queries to many different data sources, including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds. Apache Hama - is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms. Akka - a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. ML-Hadoop - Hadoop implementation of Machine learning algorithms Shark - is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Apache Crunch - Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run Azkaban - batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs Apache Mesos - is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.
    [Show full text]
  • Informatica 10.2 Hotfix 2 Release Notes April 2019
    Informatica 10.2 HotFix 2 Release Notes April 2019 © Copyright Informatica LLC 1998, 2020 Contents Installation and Upgrade......................................................................... 3 Informatica Upgrade Paths......................................................... 3 Upgrading from 9.6.1............................................................. 4 Upgrading from Version 10.0, 10.1, 10.1.1, and 10.1.1 HotFix 1.............................. 4 Upgrading from Version 10.1.1 HF2.................................................. 5 Upgrading from 10.2.............................................................. 6 Related Links ................................................................... 7 Verify the Hadoop Distribution Support................................................ 7 Hotfix Installation and Rollback..................................................... 8 10.2 HotFix 2 Fixed Limitations and Closed Enhancements........................................ 17 Analyst Tool Fixed Limitations and Closed Enhancements (10.2 HotFix 2).................... 17 Application Service Fixed Limitations and Closed Enhancements (10.2 HotFix 2)............... 17 Command Line Programs Fixed Limitations and Closed Enhancements (10.2 HotFix 2).......... 17 Developer Tool Fixed Limitations and Closed Enhancements (10.2 HotFix 2).................. 18 Informatica Connector Toolkit Fixed Limitations and Closed Enhancements (10.2 HotFix 2) ...... 18 Mappings and Workflows Fixed Limitations (10.2 HotFix 2)............................... 18 Metadata
    [Show full text]
  • Optimizing Resource Utilization in Distributed Computing Systems For
    THESE` DE DOCTORAT DE L’ETABLISSEMENT´ UNIVERSITE´ BOURGOGNE FRANCHE-COMTE´ PREPAR´ EE´ A` L’UNIVERSITE´ DE FRANCHE-COMTE´ Ecole´ doctorale n°37 Sciences Pour l’Ingenieur´ et Microtechniques Doctorat d’Informatique par ANTHONY NASSAR Optimizing Resource Utilization in Distributed Computing Systems for Automotive Applications Optimisation de l’utilisation des ressources dans les systemes` informatiques distribues´ pour les applications automobiles These` present´ ee´ et soutenue publiquement le 04-02-2021 a` Belfort, devant le Jury compose´ de : MR CERIN CHRISTOPHE Professeur a` l’Universite´ Sorbonne Paris Nord President´ MR CHBEIR RICHARD Professeur a` l’Universite´ de Pau et des Pays de l’Adour Rapporteur MME BENBERNOU SALIMA Professeur a` l’Universite´ Paris-Descartes Rapporteur MR MOSTEFAOUI AHMED Maˆıtre de conferences´ a` l’Universite´ de Franche-Comte´ Directeur de these` MR DESSABLES FRANC¸ OIS Ingenieur´ chez Groupe PSA Codirecteur de these` DOCTORAL THESIS OF THE UNIVERSITY BOURGOGNE FRANCHE-COMTE´ INSTITUTION PREPARED AT UNIVERSITE´ DE FRANCHE-COMTE´ Doctoral school n°37 Engineering Sciences and Microtechnologies Computer Science Doctorate by ANTHONY NASSAR Optimizing Resource Utilization in Distributed Computing Systems for Automotive Applications Optimisation de l’utilisation des ressources dans les systemes` informatiques distribues´ pour les applications automobiles Thesis presented and publicly defended in Belfort, on 04-02-2021 Composition of the Jury : CERIN CHRISTOPHE Professor at Universite´ Sorbonne Paris Nord President
    [Show full text]
  • Graft: a Debugging Tool for Apache Giraph
    Graft: A Debugging Tool For Apache Giraph Semih Salihoglu, Jaeho Shin, Vikesh Khanna, Ba Quan Truong, Jennifer Widom Stanford University {semih, jaeho.shin, vikesh, bqtruong, widom}@cs.stanford.edu ABSTRACT optional master.compute() function is executed by the Master We address the problem of debugging programs written for Pregel- task between supersteps. like systems. After interviewing Giraph and GPS users, we devel- We have tackled the challenge of debugging programs written oped Graft. Graft supports the debugging cycle that users typically for Pregel-like systems. Despite being a core component of pro- go through: (1) Users describe programmatically the set of vertices grammers’ development cycles, very little work has been done on they are interested in inspecting. During execution, Graft captures debugging in these systems. We interviewed several Giraph and the context information of these vertices across supersteps. (2) Us- GPS programmers (hereafter referred to as “users”) and studied vertex.compute() ing Graft’s GUI, users visualize how the values and messages of the how they currently debug their functions. captured vertices change from superstep to superstep,narrowing in We found that the following three steps were common across users: suspicious vertices and supersteps. (3) Users replay the exact lines (1) Users add print statements to their code to capture information of the vertex.compute() function that executed for the sus- about a select set of potentially “buggy” vertices, e.g., vertices that picious vertices and supersteps, by copying code that Graft gener- are assigned incorrect values, send incorrect messages, or throw ates into their development environments’ line-by-line debuggers.
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Apache Calcite: a Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources Edmon Begoli Jesús Camacho-Rodríguez Julian Hyde Oak Ridge National Laboratory Hortonworks Inc. Hortonworks Inc. (ORNL) Santa Clara, California, USA Santa Clara, California, USA Oak Ridge, Tennessee, USA [email protected] [email protected] [email protected] Michael J. Mior Daniel Lemire David R. Cheriton School of University of Quebec (TELUQ) Computer Science Montreal, Quebec, Canada University of Waterloo [email protected] Waterloo, Ontario, Canada [email protected] ABSTRACT argued that specialized engines can offer more cost-effective per- Apache Calcite is a foundational software framework that provides formance and that they would bring the end of the “one size fits query processing, optimization, and query language support to all” paradigm. Their vision seems today more relevant than ever. many popular open-source data processing systems such as Apache Indeed, many specialized open-source data systems have since be- Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite’s ar- come popular such as Storm [50] and Flink [16] (stream processing), chitecture consists of a modular and extensible query optimizer Elasticsearch [15] (text search), Apache Spark [47], Druid [14], etc. with hundreds of built-in optimization rules, a query processor As organizations have invested in data processing systems tai- capable of processing a variety of query languages, an adapter ar- lored towards their specific needs, two overarching problems have chitecture designed for extensibility, and support for heterogeneous arisen: data models and stores (relational, semi-structured, streaming, and • The developers of such specialized systems have encoun- geospatial). This flexible, embeddable, and extensible architecture tered related problems, such as query optimization [4, 25] is what makes Calcite an attractive choice for adoption in big- or the need to support query languages such as SQL and data frameworks.
    [Show full text]
  • A Review on Big Data Analytics in the Field of Agriculture
    International Journal of Latest Transactions in Engineering and Science A Review on Big Data Analytics in the field of Agriculture Harish Kumar M Department. of Computer Science and Engineering Adhiyamaan College of Engineering,Hosur, Tamilnadu, India Dr. T Menakadevi Dept. of Electronics and Communication Engineering Adhiyamaan College of Engineering,Hosur, Tamilnadu, India Abstract- Big Data Analytics is a Data-Driven technology useful in generating significant productivity improvement in various industries by collecting, storing, managing, processing and analyzing various kinds of structured and unstructured data. The role of big data in Agriculture provides an opportunity to increase economic gain of the farmers by undergoing digital revolution in this aspect we examine through precision agriculture schemas equipped in many countries. This paper reviews the applications of big data to support agriculture. In addition it attempts to identify the tools that support the implementation of big data applications for agriculture services. The review reveals that several opportunities are available for utilizing big data in agriculture; however, there are still many issues and challenges to be addressed to achieve better utilization of this technology. Keywords—Agriculture, Big data Analytics, Hadoop, HDFS, Farmers I. INTRODUCTION The technologies employed are exciting, involve analysis of mind-numbing amounts of data and require fundamental rethinking as to what constitutes data. Big data is a collecting raw data which undergoes various phases like Classification, Processing and organizing into meaningful information. Raw information cannot be consumed directly for any form of analysis. It’s a process of examining uncover patterns, finding unknown correlation and finding useful information which are adopted for decision making analysis.
    [Show full text]