Intro to Apache Kudu Hadoop Storage for Fast Analytics on Fast Data

Total Page:16

File Type:pdf, Size:1020Kb

Intro to Apache Kudu Hadoop Storage for Fast Analytics on Fast Data Intro to Apache Kudu Hadoop storage for fast analytics on fast data Mike Percy Software Engineer at Cloudera Apache Kudu PMC member © Cloudera, Inc. All rights reserved. 1 Apache Kudu Storage for fast (low latency) analytics on fast (high throughput) data DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS • Simplifies the architecture for building BATCH STREAM SQL SEARCH MODEL ONLINE analytic applications on changing data SPARK, SPARK IMPALA SOLR SPARK HBASE HIVE, PIG • Optimized for fast analytic performance UNIFIED DATA SERVICES RESOURCE MANAGEMENT – YARN • Natively integrated with the Hadoop SECURITY – SENTRY ecosystem of components DATA INTEGRATION & STORAGE FILESYSTEM COLUMNAR STORE NoSQL HDFS KUDU HBASE INGEST – SQOOP, FLUME, KAFKA © Cloudera, Inc. All rights reserved. 2 Why Kudu? © Cloudera, Inc. All rights reserved. 3 Previous Hadoop storage landscape HDFS (GFS) excels at: • Batch ingest only (eg hourly) • Efficiently scanning large amounts of data (analytics) HBase (BigTable) excels at: • Efficiently finding and writing individual rows • Making data mutable Gaps exist when these properties are needed simultaneously © Cloudera, Inc. All rights reserved. 4 Kudu design goals • High throughput for big scans Goal: Within 2x of Parquet • Low-latency for short accesses Goal: 1ms read/write on SSD • Database-like semantics Initially, single-row atomicity • Relational data model • SQL queries should be natural and easy • Include NoSQL-style scan, insert, and update APIs © Cloudera, Inc. All rights reserved. 5 Changing hardware landscape • Spinning disk -> solid state storage • NAND Flash: Up to 450k read 250k write IOPS, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than Flash, cheaper than RAM) • RAM is cheaper and more abundant: • 64->128->256GB over last few years • Takeaway: The next performance bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind © Cloudera, Inc. All rights reserved. 6 Apache Kudu: Scalable and fast structured storage Tables • Represents data in structured tables like a normal database • Individual record-level access to 100+ billion row tables Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs © Cloudera, Inc. All rights reserved. 7 Storing records in Kudu tables • A Kudu table has a SQL-like schema • And a finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java, Python, and C++ NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • SQL via integrations with Impala and Spark • Community work in progress / experimental: Drill, Hive © Cloudera, Inc. All rights reserved. 8 Primary Key • Every table must have a primary key • A primary key is comprised of one or more columns • Primary key values must be unique • The columns that comprise a primary key may not be • Boolean or floating-point typed • Nullable • Kudu does not allow the primary key values of a row to be updated • Kudu requires primary key fields to be defined as the first fields of the table schema • Rows within a tablet are stored in primary key sorted order © Cloudera, Inc. All rights reserved. 9 Integrations Kudu is designed for integrating with higher-level compute frameworks Integrations exist for: • Impala • Spark • MapReduce • Flume • Drill © Cloudera, Inc. All rights reserved. 10 Use cases © Cloudera, Inc. All rights reserved. 11 Kudu use cases Kudu is best for use cases requiring: • Simultaneous combination of sequential and random reads and writes • Minimal to zero data latencies Time series • Examples: Streaming market data; fraud detection & prevention; network monitoring • Workload: Inserts, updates, scans, lookups Online reporting / data warehousing • Example: Operational Data Store (ODS) • Workload: Inserts, updates, scans, lookups © Cloudera, Inc. All rights reserved. 12 “Traditional” real-time analytics in Hadoop Fraud detection in the real world = storage complexity Storage in HDFS Considerations: Kafka • How do I handle failure during this process? • How often do I reorganize Have we accumulated Historical Data data streaming in into a enough data? format appropriate for Reporting reporting? Request HBase • When reporting, how do I see data that has not yet been Reorganize Most Recent Partition HBase file reorganized? into Parquet New Partition • How do I ensure that Parquet important jobs aren’t File interrupted by maintenance? • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file © Cloudera, Inc. All rights reserved. 13 Real-time analytics in Hadoop with Kudu Storage in Kudu Improvements: • One system to operate • No cron jobs or background processes Incoming data Historical and Real-time • Handle late arrivals or data (e.g. Kafka) Data corrections with ease Reporting • New data available Request immediately for analytics or operations © Cloudera, Inc. All rights reserved. 14 Xiaomi use case th • World’s 4 largest smart-phone maker (most popular in China) • Gather important RPC tracing events from mobile app and backend service. • Service monitoring & troubleshooting tool. High write throughput • >20 Billion records/day and growing Query latest data and quick response • Identify and resolve issues quickly Can search for individual records • Easy for troubleshooting © Cloudera, Inc. All rights reserved. 15 Xiaomi big data analytics pipeline Before Kudu Long pipeline • High data latency (approx 1 hour – 1 day) • Data conversion pains No ordering • Log arrival (storage) order is not exactly logical order • Must read 2 – 3 days of data to get all of the data points for a single day © Cloudera, Inc. All rights reserved. 16 Xiaomi big data analytics pipeline Simplified with Kafka and Kudu OLAP scan ETL pipeline Side table lookup • 0 – 10s data latency Result store • Apps that need to avoid backpressure or need ETL Direct pipeline (no latency) • Apps that don’t require ETL or backpressure handling © Cloudera, Inc. All rights reserved. 17 JD.com use case nd •2 largest online retailer in China Browser tracing Web logs •Real-time ingestion via Kafka •Click logs Kafka •Application/Browser tracing •~70 columns per row Kudu •6/18 sale day •15B transactions Impala •10M inserts/sec peak JDBC access •200 node cluster Web-app Developers •Query via JDBC -> Impala -> Kudu Marketing Dept. © Cloudera, Inc. All rights reserved. 18 Kudu+Impala vs MPP DWH Commonalities ✓ Fast analytic queries via SQL, including most commonly used modern features ✓ Ability to insert, update, and delete data Differences ✓ Faster streaming inserts ✓ Improved Hadoop integration • JOIN between HDFS + Kudu tables, run on same cluster • Spark, Flume, other integrations ✗ Slower batch inserts ✗ No transactional data loading, multi-row transactions, or indexing © Cloudera, Inc. All rights reserved. 19 How it works Replication and fault tolerance © Cloudera, Inc. All rights reserved. 20 Tables, Tablets, and Tablet Servers •Each table is horizontally partitioned into tablets •Range or hash partitioning •PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS •Each tablet has N replicas (default = 3) with Raft consensus •Automatic fault tolerance •MTTR: ~5 seconds •Tablet servers host tablets on local disk drives •Master servers manage the cluster’s metadata © Cloudera, Inc. All rights reserved. 21 Master servers • Master servers (3 - 5 of them) manage the cluster’s metadata • Manage schemas and tables (and the corresponding tablets) • CREATE / ALTER / DROP TABLE • Track the locations of all of the tablet replicas • Detect when tablet replicas fail and initiate data re-replication • Internally, the “master” metadata is stored in a special type of tablet that only lives on the master servers • The master tablet uses Raft consensus for replication across the master servers’ local disk drives © Cloudera, Inc. All rights reserved. 22 How it works Columnar storage © Cloudera, Inc. All rights reserved. 23 Columnar storage Tweet_id User_name {25059873, {newsycbot, 22309487, RideImpala, 23059861, fastly, 23010982} llvmorg} Created_at text {1442865158, {Visual exp…, 1442828307, Introducing .., 1442865156, Missing July…, 1442865155} LLVM 3.7….} © Cloudera, Inc. All rights reserved. 24 Columnar storage Only read 1 column Tweet_id User_name Created_at text {25059873, {newsycbot, {1442865158, {Visual exp…, 22309487, RideImpala, 1442828307, Introducing .., 23059861, fastly, 1442865156, Missing July…, 1442865155} LLVM 3.7….} 23010982} llvmorg} 1GB 2GB 1GB 200GB SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; © Cloudera, Inc. All rights reserved. 25 Columnar compression Created_at Diff(created_at) • Many columns can compress to Created_at a few bits per row! 1442825158 n/a • Especially: {1442825158, • Timestamps 1442826100, 1442826100 942 1442827994, 1442828527} • Time series values 1442827994 1894 • Low-cardinality strings 1442828527 533 • Massive space savings and throughput increase! 64 bits each 11 bits each © Cloudera, Inc. All rights reserved. 26 Representing time series in Kudu © Cloudera, Inc. All rights reserved. 27 What is time series? Data that can be usefully partitioned and queried based on time Examples: • Web user activity
Recommended publications
  • Netapp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (Netapp) and Udai Potluri (Cloudera) June 2018 | WP-7217
    White Paper NetApp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (NetApp) and Udai Potluri (Cloudera) June 2018 | WP-7217 In partnership with Abstract There has been an exponential growth in data over the past decade and analyzing huge amounts of data in a reasonable time can be a challenge. Apache Hadoop is an open- source tool that can help your organization quickly mine big data and extract meaningful patterns from it. However, enterprises face several technical challenges when deploying Hadoop, specifically in the areas of cluster availability, operations, and scaling. NetApp® has developed a reference architecture with Cloudera to deliver a solution that overcomes some of these challenges so that businesses can ingest, store, and manage big data with greater reliability and scalability and with less time spent on operations and maintenance. This white paper discusses a flexible, validated, enterprise-class Hadoop architecture that is based on NetApp E-Series storage using Cloudera’s Hadoop distribution. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 4 1.1 Big Data ..........................................................................................................................................................4 1.2 Hadoop Overview ...........................................................................................................................................4 2 NetApp E-Series
    [Show full text]
  • Administration and Configuration Guide
    Red Hat JBoss Data Virtualization 6.4 Administration and Configuration Guide This guide is for administrators. Last Updated: 2018-09-26 Red Hat JBoss Data Virtualization 6.4 Administration and Configuration Guide This guide is for administrators. Red Hat Customer Content Services Legal Notice Copyright © 2018 Red Hat, Inc. This document is licensed by Red Hat under the Creative Commons Attribution-ShareAlike 3.0 Unported License. If you distribute this document, or a modified version of it, you must provide attribution to Red Hat, Inc. and provide a link to the original. If the document is modified, all Red Hat trademarks must be removed. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Node.js ® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
    [Show full text]
  • Synthesis and Development of a Big Data Architecture for the Management of Radar Measurement Data
    1 Faculty of Electrical Engineering, Mathematics & Computer Science Synthesis and Development of a Big Data architecture for the management of radar measurement data Alex Aalbertsberg Master of Science Thesis November 2018 Supervisors: dr. ir. Maurice van Keulen (University of Twente) prof. dr. ir. Mehmet Aks¸it (University of Twente) dr. Doina Bucur (University of Twente) ir. Ronny Harmanny (Thales) University of Twente P.O. Box 217 7500 AE Enschede The Netherlands Approval Internship report/Thesis of: …………………………………………………………………………………………………………Alexander P. Aalbertsberg Title: …………………………………………………………………………………………Synthesis and Development of a Big Data architecture for the management of radar measurement data Educational institution: ………………………………………………………………………………..University of Twente Internship/Graduation period:…………………………………………………………………………..2017-2018 Location/Department:.…………………………………………………………………………………435 Advanced Development, Delft Thales Supervisor:……………………………………………………………………………R. I. A. Harmanny This report (both the paper and electronic version) has been read and commented on by the supervisor of Thales Netherlands B.V. In doing so, the supervisor has reviewed the contents and considering their sensitivity, also information included therein such as floor plans, technical specifications, commercial confidential information and organizational charts that contain names. Based on this, the supervisor has decided the following: o This report is publicly available (Open). Any defence may take place publicly and the report may be included in public libraries and/or published in knowledge bases. • o This report and/or a summary thereof is publicly available to a limited extent (Thales Group Internal). tors . It will be read and reviewed exclusively by teachers and if necessary by members of the examination board or review ? committee. The content will be kept confidential and not disseminated through publication or inclusion in public libraries and/or knowledge bases.
    [Show full text]
  • Groups and Activities Report 2017
    Groups and Activities Report 2017 ISBN 978-92-9083-491-5 This report is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 2 | Page CERN IT Department Groups and Activities Report 2017 CONTENTS GROUPS REPORTS 2017 Collaborations, Devices & Applications (CDA) Group ............................................................................. 6 Communication Systems (CS) Group .................................................................................................... 11 Compute & Monitoring (CM) Group ..................................................................................................... 16 Computing Facilities (CF) Group ........................................................................................................... 20 Databases (DB) Group ........................................................................................................................... 23 Departmental Infrastructure (DI) Group ............................................................................................... 27 Storage (ST) Group ................................................................................................................................ 28 ACTIVITIES AND PROJECTS REPORTS 2017 CERN openlab ........................................................................................................................................ 34 CERN School of Computing (CSC) .........................................................................................................
    [Show full text]
  • Using Apache Phoenix to Store and Access Data Date Published: 2020-02-29 Date Modified: 2020-07-28
    Cloudera Runtime 7.2.1 Using Apache Phoenix to Store and Access Data Date published: 2020-02-29 Date modified: 2020-07-28 https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH.
    [Show full text]
  • CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer
    CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer 30 Juin 2020 SPEAKERS • © 2019 Cloudera, Inc. All rights reserved. AGENDA • CDP DATA CENTER OVERVIEW • DETAILS ABOUT MAJOR COMPONENTS • PATH TO CDP DC && SMART MIGRATION • Q/A © 2019 Cloudera, Inc. All rights reserved. CLOUDERA DATA PLATFORM © 2020 Cloudera, Inc. All rights reserved. 4 ARCHITECTURE CIBLE : ENTERPRISE DATA CLOUD CDP Cloud Public CDP On-Prem (platform-as-a-service) (installable software) © 2020 Cloudera, Inc. All rights reserved. 5 CDP DATA CENTER OVERVIEW CDP Data Center (installable software) NEW CDP Data Center features include: Cloudera Manager • High-performance SQL analytics • Real-time stream processing, analytics, and management • Fine-grained security, enterprise metadata, and scalable data lineage • Support for object storage (tech preview) • Single pane of glass for management - multi-cluster support Enterprise analytics and data management platform, built for hybrid cloud, optimized for bare metal and ready for private cloud Cloudera Runtime © 2020 Cloudera, Inc. All rights reserved. 6 A NEW OPEN SOURCE DISTRIBUTION FOR BETTER CAPABILITY Cloudera Runtime - created from the best of CDH and HDP Deprecate competitive Merge overlapping Keep complementary Upgrade shared technologies technologies technologies technologies © 2019 Cloudera, Inc. All rights reserved. 7 COMPONENT LIST CDP Data Center 7.1(May) 2020 • Cloudera Manager 7.1 • HBase 2.2 • Key HSM 7.1 • Kafka Schema Registry 0.8
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Kyuubi Release 1.3.0 Kent
    Kyuubi Release 1.3.0 Kent Yao Sep 30, 2021 USAGE GUIDE 1 Multi-tenancy 3 2 Ease of Use 5 3 Run Anywhere 7 4 High Performance 9 5 Authentication & Authorization 11 6 High Availability 13 6.1 Quick Start................................................ 13 6.2 Deploying Kyuubi............................................ 47 6.3 Kyuubi Security Overview........................................ 76 6.4 Client Documentation.......................................... 80 6.5 Integrations................................................ 82 6.6 Monitoring................................................ 87 6.7 SQL References............................................. 94 6.8 Tools................................................... 98 6.9 Overview................................................. 101 6.10 Develop Tools.............................................. 113 6.11 Community................................................ 120 6.12 Appendixes................................................ 128 i ii Kyuubi, Release 1.3.0 Kyuubi™ is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark™. In general, the complete ecosystem of Kyuubi falls into the hierarchies shown in the above figure, with each layer loosely coupled to the other. For example, you can use Kyuubi, Spark and Apache Iceberg to build and manage Data Lake with pure SQL for both data processing e.g. ETL, and analytics e.g. BI. All workloads can be done on one platform, using one copy of data, with one SQL interface. Kyuubi provides the following features: USAGE GUIDE 1 Kyuubi, Release 1.3.0 2 USAGE GUIDE CHAPTER ONE MULTI-TENANCY Kyuubi supports the end-to-end multi-tenancy, and this is why we want to create this project despite that the Spark Thrift JDBC/ODBC server already exists. 1. Supports multi-client concurrency and authentication 2. Supports one Spark application per account(SPA). 3. Supports QUEUE/NAMESPACE Access Control Lists (ACL) 4.
    [Show full text]
  • Apache Hbase. | 1
    apache hbase. | 1 how hbase works *this is a study guide that was created from lecture videos and is used to help you gain an understanding of how hbase works. HBase Foundations Yahoo released the Hadoop data storage system and Google added HDFS programming interface. HDFS stands for Hadoop Distributed File System and it spreads data across what are called nodes in it’s cluster. The data does not have a schema as it is just document/files. HDFS is schemaless, distributed and fault tolerant. MapReduce is focused on data processing and jobs to write to MapReduce are in Java. The operations of a MapReduce job is to find the data and list tasks that it needs to execute and then execute them. The action of executing is called reducers. A downside is that it is batch oriented, which means you would have to read the entire file of data even if you would like to read a small portion of data. Batch oriented is slow. Hadoop is semistructured data and unstructured data, there is no random access for Hadoop and no transaction support. HBase is also called the Hadoop database and unlike Hadoop or HDFS, it has a schema. There is an in-memory feature that gives you the ability to read information quickly. You can isolate the data you want to analyze. HBase is random access. HBase allows for CRUD, which is Creating a new document, Reading the information into an application or process, Update which will allow you to change the value and Delete where mind movement machine.
    [Show full text]
  • Hortonworks Data Platform for Enterprise Data Lakes Delivers Robust, Big Data Analytics That Accelerate Decision Making and Innovation
    IBM United States Software Announcement 218-187, dated March 20, 2018 Hortonworks Data Platform for Enterprise Data Lakes delivers robust, big data analytics that accelerate decision making and innovation Table of contents 1 Overview 5 Publications 2 Key prerequisites 5 Technical information 2 Planned availability date 6 Ordering information 2 Description 7 Terms and conditions 5 Program number 9 Prices 10 Corrections Overview Hortonworks Data Platform is an enterprise ready open source Apache Hadoop distribution based on a centralized architecture supported by YARN. Hortonworks Data Platform is designed to address the needs of data at rest, power real-time customer applications, and deliver big data analytics that can help accelerate decision making and innovation. The official Apache versions for Hortonworks Data Platform V2.6.4 include: • Apache Accumulo 1.7.0 • Apache Atlas 0.8.0 • Apache Calcite 1.2.0 • Apache DataFu 1.3.0 • Apache Falcon 0.10.0 • Apache Flume 1.5.2 • Apache Hadoop 2.7.3 • Apache HBase 1.1.2 • Apache Hive 1.2.1 • Apache Hive 2.1.0 • Apache Kafka 0.10.1 • Apache Knox 0.12.0 • Apache Mahout 0.9.0 • Apache Oozie 4.2.0 • Apache Phoenix 4.7.0 • Apache Pig 0.16.0 • Apache Ranger 0.7.0 • Apache Slider 0.92.0 • Apache Spark 1.6.3 • Apache Spark 2.2.0 • Apache Sqoop 1.4.6 • Apache Storm 1.1.0 • Apache TEZ 0.7.0 • Apache Zeppelin 0.7.3 IBM United States Software Announcement 218-187 IBM is a registered trademark of International Business Machines Corporation 1 • Apache ZooKeeper 3.4.6 IBM(R) clients can download this new offering from Passport Advantage(R).
    [Show full text]
  • Chapter 2 Introduction to Big Data Technology
    Chapter 2 Introduction to Big data Technology Bilal Abu-Salih1, Pornpit Wongthongtham2 Dengya Zhu3 , Kit Yan Chan3 , Amit Rudra3 1The University of Jordan 2 The University of Western Australia 3 Curtin University Abstract: Big data is no more “all just hype” but widely applied in nearly all aspects of our business, governments, and organizations with the technology stack of AI. Its influences are far beyond a simple technique innovation but involves all rears in the world. This chapter will first have historical review of big data; followed by discussion of characteristics of big data, i.e. from the 3V’s to up 10V’s of big data. The chapter then introduces technology stacks for an organization to build a big data application, from infrastructure/platform/ecosystem to constructional units and components. Finally, we provide some big data online resources for reference. Keywords Big data, 3V of Big data, Cloud Computing, Data Lake, Enterprise Data Centre, PaaS, IaaS, SaaS, Hadoop, Spark, HBase, Information retrieval, Solr 2.1 Introduction The ability to exploit the ever-growing amounts of business-related data will al- low to comprehend what is emerging in the world. In this context, Big Data is one of the current major buzzwords [1]. Big Data (BD) is the technical term used in reference to the vast quantity of heterogeneous datasets which are created and spread rapidly, and for which the conventional techniques used to process, analyse, retrieve, store and visualise such massive sets of data are now unsuitable and inad- equate. This can be seen in many areas such as sensor-generated data, social media, uploading and downloading of digital media.
    [Show full text]
  • Release Notes Date Published: 2020-08-10 Date Modified
    Cloudera Runtime 7.1.3 Release Notes Date published: 2020-08-10 Date modified: https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH.
    [Show full text]