Apache Apex: Next Gen Big Data Analytics

Total Page:16

File Type:pdf, Size:1020Kb

Apache Apex: Next Gen Big Data Analytics Apache Apex: Next Gen Big Data Analytics Thomas Weise <[email protected]> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14th 2016 Stream Data Processing Data Delivery Transform / Analytics Real-time visualization, … Declarative SQL API Data Beam Beam SAMOA Operator SAMOA DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2 Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring 3 Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in member variables • Windowing, event-time processing • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes and resource allocation, elasticity 4 Native Hadoop Integration • YARN is the resource manager • HDFS for storing persistent state 5 Application Development Model A Stream is a sequence of data Directed Acyclic Graph (DAG) tuples A typical Operator takes one or Filtered Enriched more input streams, performs Stream Stream computations & emits one or more output streams Output Operator • Each Operator is YOUR custom Stream business logic in java, or built-in operator from our open source library Tuple Operator Operator Operator Operator • Operator has many instances that run in parallel and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) is Operator made up of operators and streams 6 Development Process Kafka Word JDBC Parser Filter Input Counter Output Lines Words Filtered Counts Kafka Database Apex Application • Operators from library or develop for custom logic • Connect operators to form application • Configure operator properties • Configure scaling and other platform attributes • Test functionality, performance, iterate 7 Application Specification DAG API (compositional) Java Stream API (declarative) 8 Developing Operators 9 Operator Library Messaging NoSQL RDBMS • Kafka • Cassandra, HBase • JDBC • JMS (ActiveMQ, …) • Aerospike, Accumulo • MySQL • Kinesis, SQS • Couchbase/ CouchDB • Oracle • Flume, NiFi • Redis, MongoDB • MemSQL • Geode File Systems Parsers Transformations • HDFS/ Hive • XML • Filter, Expression, Enrich • NFS • JSON • Windowing, Aggregation • S3 • CSV • Join • Avro • Dedup • Parquet Analytics Protocols Other • Dimensional Aggregations • HTTP • Elastic Search (with state management for • FTP • Script (JavaScript, Python, R) historical data + query) • WebSocket • Solr • MQTT • Twitter • SMTP 10 Stateful Processing with Event Time Event Stream k=A k=B k=B k=A k=A t=4:00 t=5:00 t=5:59 t=4:30 t=5:00 Processing Time +30s +60s +90s (All) : 1 (All) : 4 (All) : 5 t=4:00 : 1 t=4:00 : 2 t=4:00 : 2 State t=5:00 : 2 t=5:00 : 3 k=A, t=4:00 : 1 k=A, t=4:00 : 2 k=A, t=4:00 : 2 k=A, t=5:00 : 1 K=B, t=5:00 : 2 k=B, t=5:00 : 2 11 Windowing - Apache Beam Model Event-time Session windows Watermarks Accumulation Triggers Keyed or Not Keyed Allowed Lateness Accumulation Mode Merging streams ApexStream<String> stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print(); 12 Fault Tolerance • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log 13 Checkpointing State . Distributed, asynchronous . No artificial latency . Periodic callbacks . Pluggable storage 14 Buffer Server & Recovery • In-memory PubSub Container 1 Container 2 • Stores results until committed Operator Buffer Operator • Backpressure / spillover to disk 1 Server 2 • Ordering, idempotency Node 1 Node 2 Independent pipelines Downstream Operators reset (can be used for speculative execution) 15 Recovery Scenario sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 0 sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 7 sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 10 sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 7 16 Processing Guarantees At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior 17 End-to-End Exactly Once • Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing 18 Scalability Unifier NxM Partitions Logical DAG Logical Diagram 0 1 2 3 0 1 2 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier 1a 2a Physical Diagram with operator 1 with 3 partitions 0 1b Unifier Unifier 3 1 2b 1c Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck 0 1 Unifier 2 1a Unifier 2a 1 0 1b Unifier 3 Unifier 2b 1c 19 Advanced Partitioning Parallel Partition Cascading Unifiers Logical DAG Logical Plan 0 1 2 3 4 uopr dopr Execution Plan, for N = 4; M = 1 Physical DAG uopr1 1a uopr2 Container 0 Unifier 2 3 4 unifier dopr NIC 1b uopr3 uopr4 Physical DAG with Parallel Partition Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers uopr1 Container 1a 2a 3a unifier NIC NIC 0 Unifier 4 uopr2 Container unifier dopr NIC 1b 2b 3b uopr3 Container unifier NIC NIC uopr4 20 Dynamic Partitioning 2a 2a 1a 2a 1a 2b 1a 2b 3a 3 3 1b 2b 1b 2c 1b 2c 3b 2d 2d Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 21 How dynamic partitioning works • Partitioning decision (yes/no) by trigger (StatsListener) ᵒ Pluggable component, can use any system or custom metric ᵒ Externally driven partitioning example: KafkaInputOperator • Stateful! ᵒ Uses checkpointed state ᵒ Ability to transfer state from old to new partitions (partitioner, customizable) ᵒ Steps: • Call partitioner • Modify physical plan, rewrite checkpoints as needed • Undeploy old partitions from execution layer • Release/request container resources • Deploy new partitions (from rewritten checkpoint) ᵒ No loss of data (buffered) ᵒ Incremental operation, partitions that don’t change continue processing • API: Partitioner interface 22 Compute Locality • By default operators are deployed in containers (processes) on different nodes across the Hadoop cluster • Locality options for streams ᵒ RACK_LOCAL: Data does not traverse network switches ᵒ NODE_LOCAL: Data transfer via loopback interface, frees up network bandwidth ᵒ CONTAINER_LOCAL: Data transfer via in memory queues between operators, does not require serialization ᵒ THREAD_LOCAL: Data passed through call stack, operators share thread • Host Locality ᵒ Operators can be deployed on specific hosts • (Anti-)Affinity ᵒ Ability to express relative deployment without specifying a host 23 Performance: Throughput vs. Latency? https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
Recommended publications
  • Java Linksammlung
    JAVA LINKSAMMLUNG LerneProgrammieren.de - 2020 Java einfach lernen (klicke hier) JAVA LINKSAMMLUNG INHALTSVERZEICHNIS Build ........................................................................................................................................................... 4 Caching ....................................................................................................................................................... 4 CLI ............................................................................................................................................................... 4 Cluster-Verwaltung .................................................................................................................................... 5 Code-Analyse ............................................................................................................................................. 5 Code-Generators ........................................................................................................................................ 5 Compiler ..................................................................................................................................................... 6 Konfiguration ............................................................................................................................................. 6 CSV ............................................................................................................................................................. 6 Daten-Strukturen
    [Show full text]
  • Declarative Languages for Big Streaming Data a Database Perspective
    Tutorial Declarative Languages for Big Streaming Data A database Perspective Riccardo Tommasini Sherif Sakr University of Tartu Unversity of Tartu [email protected] [email protected] Emanuele Della Valle Hojjat Jafarpour Politecnico di Milano Confluent Inc. [email protected] [email protected] ABSTRACT sources and are pushed asynchronously to servers which are The Big Data movement proposes data streaming systems to responsible for processing them [13]. tame velocity and to enable reactive decision making. However, To facilitate the adoption, initially, most of the big stream approaching such systems is still too complex due to the paradigm processing systems provided their users with a set of API for shift they require, i.e., moving from scalable batch processing to implementing their applications. However, recently, the need for continuous data analysis and pattern detection. declarative stream processing languages has emerged to simplify Recently, declarative Languages are playing a crucial role in common coding tasks; making code more readable and main- fostering the adoption of Stream Processing solutions. In partic- tainable, and fostering the development of more complex appli- ular, several key players introduce SQL extensions for stream cations. Thus, Big Data frameworks (e.g., Flink [9], Spark [3], 1 processing. These new languages are currently playing a cen- Kafka Streams , and Storm [19]) are starting to develop their 2 3 4 tral role in fostering the stream processing paradigm shift. In own SQL-like approaches (e.g., Flink SQL , Beam SQL , KSQL ) this tutorial, we give an overview of the various languages for to declaratively tame data velocity. declarative querying interfaces big streaming data.
    [Show full text]
  • The Cloud‐Based Demand‐Driven Supply Chain
    The Cloud-Based Demand-Driven Supply Chain Wiley & SAS Business Series The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions. Titles in the Wiley & SAS Business Series include: The Analytic Hospitality Executive by Kelly A. McGuire Analytics: The Agile Way by Phil Simon Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens A Practical Guide to Analytics for Governments: Using Big Data for Good by Marie Lowman Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs Business Analytics for Customer Intelligence by Gert Laursen Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs ii Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase Demand-Driven Inventory
    [Show full text]
  • Informatica 10.2 Hotfix 2 Release Notes April 2019
    Informatica 10.2 HotFix 2 Release Notes April 2019 © Copyright Informatica LLC 1998, 2020 Contents Installation and Upgrade......................................................................... 3 Informatica Upgrade Paths......................................................... 3 Upgrading from 9.6.1............................................................. 4 Upgrading from Version 10.0, 10.1, 10.1.1, and 10.1.1 HotFix 1.............................. 4 Upgrading from Version 10.1.1 HF2.................................................. 5 Upgrading from 10.2.............................................................. 6 Related Links ................................................................... 7 Verify the Hadoop Distribution Support................................................ 7 Hotfix Installation and Rollback..................................................... 8 10.2 HotFix 2 Fixed Limitations and Closed Enhancements........................................ 17 Analyst Tool Fixed Limitations and Closed Enhancements (10.2 HotFix 2).................... 17 Application Service Fixed Limitations and Closed Enhancements (10.2 HotFix 2)............... 17 Command Line Programs Fixed Limitations and Closed Enhancements (10.2 HotFix 2).......... 17 Developer Tool Fixed Limitations and Closed Enhancements (10.2 HotFix 2).................. 18 Informatica Connector Toolkit Fixed Limitations and Closed Enhancements (10.2 HotFix 2) ...... 18 Mappings and Workflows Fixed Limitations (10.2 HotFix 2)............................... 18 Metadata
    [Show full text]
  • Pohorilyi Magistr.Pdf
    НАЦІОНАЛЬНИЙ ТЕХНІЧНИЙ УНІВЕРСИТЕТ УКРАЇНИ «КИЇВСЬКИЙ ПОЛІТЕХНІЧНИЙ ІНСТИТУТ імені ІГОРЯ СІКОРСЬКОГО» Факультет інформатики та обчислювальної техніки (повна назва інституту/факультету) Кафедра автоматики та управління в технічних системах (повна назва кафедри) «На правах рукопису» «До захисту допущено» УДК ______________ Завідувач кафедри __________ _____________ (підпис) (ініціали, прізвище) “___”_____________20__ р. Магістерська дисертація зі спеціальності (спеціалізації)126 Інформаційні системи та технології на тему: Система збору та аналізу тексових даних з соціальних мереж________ ____________________________________________________________________ Виконав: студент __6__ курсу, групи ___ІА-з82мп______ (шифр групи) Погорілий Богдан Анатолійович ____________________________ __________ (прізвище, ім’я, по батькові) (підпис) Науковий керівник: завідувач кафедри, д.т.н., професор Ролік О. І. __________ (посада, науковий ступінь, вчене звання, прізвище та ініціали) (підпис) Консультант: _______________________________ __________ (назва розділу) (науковий ступінь, вчене звання, , прізвище, ініціали) (підпис) Рецензент: _______________________________________________ __________ (посада, науковий ступінь, вчене звання, науковий ступінь, прізвище та ініціали) (підпис) Засвідчую, що у цій магістерській дисертації немає запозичень з праць інших авторів без відповідних посилань. Студент _____________ (підпис) Київ – 2019 року 3 Національний технічний університет України «Київський політехнічний інститут імені Ігоря Сікорського» Факультет (інститут)
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Apache Calcite: a Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources Edmon Begoli Jesús Camacho-Rodríguez Julian Hyde Oak Ridge National Laboratory Hortonworks Inc. Hortonworks Inc. (ORNL) Santa Clara, California, USA Santa Clara, California, USA Oak Ridge, Tennessee, USA [email protected] [email protected] [email protected] Michael J. Mior Daniel Lemire David R. Cheriton School of University of Quebec (TELUQ) Computer Science Montreal, Quebec, Canada University of Waterloo [email protected] Waterloo, Ontario, Canada [email protected] ABSTRACT argued that specialized engines can offer more cost-effective per- Apache Calcite is a foundational software framework that provides formance and that they would bring the end of the “one size fits query processing, optimization, and query language support to all” paradigm. Their vision seems today more relevant than ever. many popular open-source data processing systems such as Apache Indeed, many specialized open-source data systems have since be- Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite’s ar- come popular such as Storm [50] and Flink [16] (stream processing), chitecture consists of a modular and extensible query optimizer Elasticsearch [15] (text search), Apache Spark [47], Druid [14], etc. with hundreds of built-in optimization rules, a query processor As organizations have invested in data processing systems tai- capable of processing a variety of query languages, an adapter ar- lored towards their specific needs, two overarching problems have chitecture designed for extensibility, and support for heterogeneous arisen: data models and stores (relational, semi-structured, streaming, and • The developers of such specialized systems have encoun- geospatial). This flexible, embeddable, and extensible architecture tered related problems, such as query optimization [4, 25] is what makes Calcite an attractive choice for adoption in big- or the need to support query languages such as SQL and data frameworks.
    [Show full text]
  • Hortonworks Data Platform for Enterprise Data Lakes Delivers Robust, Big Data Analytics That Accelerate Decision Making and Innovation
    IBM Europe Software Announcement ZP18-0220, dated March 20, 2018 Hortonworks Data Platform for Enterprise Data Lakes delivers robust, big data analytics that accelerate decision making and innovation Table of contents 1 Overview 5 Technical information 2 Key prerequisites 6 Ordering information 2 Planned availability date 7 Terms and conditions 2 Description 9 Prices 5 Program number 10 Announcement countries 5 Publications 10 Corrections Overview Hortonworks Data Platform is an enterprise ready open source Apache Hadoop distribution based on a centralized architecture supported by YARN. Hortonworks Data Platform is designed to address the needs of data at rest, power real-time customer applications, and deliver big data analytics that can help accelerate decision making and innovation. The official Apache versions for Hortonworks Data Platform V2.6.4 include: • Apache Accumulo 1.7.0 • Apache Atlas 0.8.0 • Apache Calcite 1.2.0 • Apache DataFu 1.3.0 • Apache Falcon 0.10.0 • Apache Flume 1.5.2 • Apache Hadoop 2.7.3 • Apache HBase 1.1.2 • Apache Hive 1.2.1 • Apache Hive 2.1.0 • Apache Kafka 0.10.1 • Apache Knox 0.12.0 • Apache Mahout 0.9.0 • Apache Oozie 4.2.0 • Apache Phoenix 4.7.0 • Apache Pig 0.16.0 • Apache Ranger 0.7.0 • Apache Slider 0.92.0 • Apache Spark 1.6.3 • Apache Spark 2.2.0 • Apache Sqoop 1.4.6 • Apache Storm 1.1.0 • Apache TEZ 0.7.0 • Apache Zeppelin 0.7.3 IBM Europe Software Announcement ZP18-0220 IBM is a registered trademark of International Business Machines Corporation 1 • Apache ZooKeeper 3.4.6 IBM(R) clients can download this new offering from Passport Advantage(R).
    [Show full text]
  • Hortonworks Data Platform Date of Publish: 2018-09-21
    Release Notes 3 Hortonworks Data Platform Date of Publish: 2018-09-21 http://docs.hortonworks.com Contents HDP 3.0.1 Release Notes..........................................................................................3 Component Versions.............................................................................................................................................3 New Features........................................................................................................................................................ 3 Deprecation Notices..............................................................................................................................................4 Terminology.............................................................................................................................................. 4 Removed Components and Product Capabilities.....................................................................................4 Unsupported Features........................................................................................................................................... 4 Technical Preview Features......................................................................................................................4 Upgrading to HDP 3.0.1...................................................................................................................................... 5 Before you begin.....................................................................................................................................
    [Show full text]
  • Integrazioa Hizkuntzaren Prozesamenduan Anotazio-Eskemak Eta Elkarreragingarritasuna. Testuen Prozesatze Masiboa, Datu Handien T
    EUSKAL HERRIKO UNIBERTSITATEA Lengoaia eta Sistema Informatikoak Doktorego-tesia Integrazioa hizkuntzaren prozesamenduan Anotazio-eskemak eta elkarreragingarritasuna. Testuen prozesatze masiboa, datu handien teknikak erabiliz. Zuhaitz Beloki Leitza Donostia, 2017 EUSKAL HERRIKO UNIBERTSITATEA Lengoaia eta Sistema Informatikoak Integrazioa hizkuntzaren prozesamenduan Anotazio-eskemak eta elkarreragingarritasuna. Testuen prozesatze masiboa, datu handien teknikak erabiliz. Zuhaitz Beloki Leitzak Xabier Artola Zubillagaren eta Aitor Soroa Etxaberen zuzendaritzapean egindako tesiaren txoste- na, Euskal Herriko Unibertsitatean Doktore titulua eskuratzeko aurkeztua. Donostia, 2017. i ii Laburpena Tesi-lan honetan hizkuntzaren prozesamenduko tresnen integrazioa landu du- gu, datu handien teknikei arreta berezia eskainiz. Tresnen integrazioa, izatez, bi mailatan landu dugu: anotazio-eskemen mailan eta prozesuen mailan. Anotazio-eskemen mailako integrazioan tresnen arteko elkarreragingarritasu- na lortzeko lehenbiziko pausoak aurkeztea izan dugu helburu. Horrekin lotu- ta, bi anotazio-eskema aurkeztu ditugu: Anotazio-Amaraunen Arkitektura (AWA, Annotation Web Architecture) eta NLP Annotation Format (NAF). AWA tesi-lan honekin hasi aurretik sortua izan zen, eta orain formalizazio- lan bat egin dugu berarekin, elkarreragingarritasunari arreta berezia jarriz. NAF, bere aldetik, eskema praktikoa eta sinplea izateko helburuekin sortu dugu. Bi anotazio-eskema horietatik abiatuz, eskemarekiko independentea den eredu abstraktu bat diseinatu dugu. Abstrakzio
    [Show full text]
  • Classifying, Evaluating and Advancing Big Data Benchmarks
    Classifying, Evaluating and Advancing Big Data Benchmarks Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität in Frankfurt am Main von Todor Ivanov aus Stara Zagora Frankfurt am Main 2019 (D 30) vom Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität als Dissertation angenommen. Dekan: Prof. Dr. Andreas Bernig Gutachter: Prof. Dott. -Ing. Roberto V. Zicari Prof. Dr. Carsten Binnig Datum der Disputation: 23.07.2019 Abstract The main contribution of the thesis is in helping to understand which software system parameters mostly affect the performance of Big Data Platforms under realistic workloads. In detail, the main research contributions of the thesis are: 1. Definition of the new concept of heterogeneity for Big Data Architectures (Chapter 2); 2. Investigation of the performance of Big Data systems (e.g. Hadoop) in virtual- ized environments (Section 3.1); 3. Investigation of the performance of NoSQL databases versus Hadoop distribu- tions (Section 3.2); 4. Execution and evaluation of the TPCx-HS benchmark (Section 3.3); 5. Evaluation and comparison of Hive and Spark SQL engines using benchmark queries (Section 3.4); 6. Evaluation of the impact of compression techniques on SQL-on-Hadoop engine performance (Section 3.5); 7. Extensions of the standardized Big Data benchmark BigBench (TPCx-BB) (Section 4.1 and 4.3); 8. Definition of a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data architectures (Section 4.5). The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications.
    [Show full text]
  • Avro Schema Builder Date
    Avro Schema Builder Date Grove jags moreover. Archie rejoin doubly? Freckly Erasmus magging globularly and stiltedly, she roll-up her berets dieting heritably. Reading the network and avro schema date Processed may be printed on the table into the copier. Kafka avro date and the builder to your experience and avro schema builder date and writers an additional component provides optimizations to. Json Schema Designer Online Clare Locke. It is not read or approved by Pivotal and does not necessarily reflect the views and opinions of Pivotal nor does it constitute any official communication of Pivotal. Post is it in avro schema json to join the kafka takes longer in the data itself and serialization of primitive or information about avro. Meet the apache avro schema defined as the schema requirements change and site design will recall an implementation detail. Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. The Workflow is configured to run daily when new. Date week-millis time-micros timestamp-millis timestamp-micros. Apache avro schema could reject the builder from the files in damages be good http is unknown number of writing. Tracing system collecting latency data from applications. SparkSession val spark SparkSessionbuildermasterlocal. This beard is a beginner's guide my writing came first Avro schema and so few tips for. Entity from avro schemas, dates can or either of the builder for instance of this article, as a schema from the jupyter notebook demonstrates how. Gets builder from schema from a date? Avro date as avro types are you are relevant and analytics query may be very easy for bytes in.
    [Show full text]