BIG DATA: HADOOP AND BEYOND

Daryl Heinz [email protected] AGENDA

• The Apache Foundation • BigData and the role of Hadoop • Overview of Hadoop • The Hadoop Distributed File System (HDFS) • Yet Another Resource Negotiator (YARN) • Application types: • Data at Rest (Batch) • Data at Motion (Streaming) • A Brief look at some “Hadoop EcoSystem” projects • The Berkeley Data Analytic Stack BIG DATA • The 3 V’s and the issue of mutability • What do you do with your current data infrastructure? • start using it IN CONCERT WITH big data frameworks • BigData can be any or all of these (and more): • Clickstream • Geographic • Sensor/Machine • Sentiment • Server Logs • Text • Big Data is poly-structured • OPEN • The ASF provides support for the Apache Community of open-source software projects, which provide software products for the public good • INNOVATION • The ASF projects are defined by collaborative consensus based processes, an open, pragmatic and a desire to create high quality software that leads the way in its field. • COMMUNITY • We consider ourselves not simply a group of projects sharing a server, but rather a community of developers and users. • APACHE PROJECTS • The all-volunteer ASF develops, stewards, and incubates more than 350 Open Source projects and initiatives that cover a wide range of technologies • [NOTE] This is where the “professional open-source”, “hybrid” and “proprietary” vendors step in with their “distributions” • http://www.apache.org/ ASF PROJECTS

• http://www.apache.org/

• The Hadoop project includes these modules: • Hadoop Common • The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™) • A distributed file system that provides HA of polystructured data • Hadoop YARN • A framework for job scheduling and cluster resource management. • Hadoop MapReduce • A YARN-based application type for batch parallel processing of large data sets. APACHE HADOOP

• http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex HDFS OVERVIEW

• The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on “commodity” or low-cost hardware • Services are fault-tolerant • Data is replicated • Increased data capacity is provided by “horizontal” as versus “vertical” scaling • A “virtual” file system “that looks like *nix” is provided to the user. • HDFS is not POSIX HDFS OVERVIEW ASSUMPTIONS AND GOALS • Hardware Failure is the norm rather than the exception. • An HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Fault detection and automatic recovery from faults is a core architectural element of HDFS • The query application (or yarn application types) specify the data type – translation to data type is part of the query (if necessary) • Storage format, data type and query data types are decoupled • The three “V”s and the question of immutability are the users resposibility • HDFS is designed for batch processing rather than interactive analysis (or streaming analysis) by users. HDFS OVERVIEW ASSUMPTIONS AND GOALS (2)

• The emphasis is on high volume of data accessed rather than low latency of data access • Moving Computation is Cheaper than Moving Data • A computation is efficient when executed where the data resides • HDFS provides interfaces for applications (YARN) so they can be moved to where the data is located • Portability Across Heterogeneous Hardware and Software Platforms • The HDFS services are portable from one platform to another • Portability facilitates adoption of HDFS as a viable virtual file system for a large range of applications, including non-ASF projects HDFS ARCHITECTURE

• https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html YARN OVERVIEW

• Yet Another Resource Negotiator • YARN is the computation framework of Hadoop whereas HDFS is the storage framework of Hadoop • YARN and its services can support multiple application types – both batch (data at rest) and streaming (data in motion) oriented • Two important points: • The Resource Manager can be configured for HA • YARN provides an API to bring legacy and new applications under the YARN resource management and application HA

• Refer to http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/WritingYarnApplications.html#Writing_a_simple_Client YARN COMPONENTS

The ResourceManager communicates the NodeManager(s) (NM) for status of tasks associated with the application type task running on its server. The ResourceManager arbitrates resources among all the applications in the system. The ApplicationMaster is an application-type specific library responsible for negotiating resources from the ResourceManager for its application type and working with the NodeManager(s) to execute and monitor the applications, or tasks.

• http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/YARN.html HADOOP ECOSYSTEM

• Many Apache projects, considered part of the “Hadoop Ecosystem” may be used without any reference to Hadoop. The following is only a sampling: • Avro™: A data serialization system • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed NoSQL, key-value database that supports structured data storage for large tables • Hive™: A SQL-presenting framework that provides abstraction over the M/R paradigm • Kafka™ : A high-throughput distributed messaging system • Mahout™: A Scalable machine learning and data mining library • Mesos: Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively • Nifi: Successor to Flume. Supports scalable directed graphs of data routing, transformation, and system mediation logic. • Phoenix: A JDBC “skin” around HBase • Pig™: A high-level data-flow language and execution framework for parallel computation • Samza: a distributed framework that uses for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management • Spark: a fast and general engine for large-scale memory or disk-resident data processing • ZooKeeper™: A high-performance coordination service for distributed applications. PIG

• Found in various locations on the web HIVE

• Defining a table: hive> CREATE TABLE mytable (name chararray, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; • ROW FORMAT is a Hive-unique command that indicate that each row is comma delimited text • HiveQL statements are terminated with a semicolon ';' • Other table operations: • SHOW TABLES • CREATE TABLE • ALTER TABLE • DROP TABLE

• Courtesy WHAT REALLY HAPPENS WITH HIVE

• Courtesy Hortonworks 20

NIFI

• NiFi automates system-to-system dataflow • dataflow: the automated and managed flow of information between systems • dataflow patterns: Gregor Hohpe. Enterprise Integration Patterns • http://www.enterpriseintegrationpatterns.com/ • Apache NiFi provides directed graphs of • data routing • Transformation • system mediation logic • Documentation: http://nifi.apache.org/docs.html

2/3/2016 21

NIFI OBJECTIVES (2)

• Web-based UI • Configurable • Flow can be modified at runtime • Data Provenance • Track dataflow from beginning to end • Designed for extension • Build your own processors • Enables rapid development and effective testing • Secure • SSL, SSH, HTTPS, encrypted content, etc... • Pluggable role-based authentication/authorization

2/3/2016 22

PHOENIX

• A relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data • Compiles a SQL query into a series of HBase scans • The running of the scans is orchestrated to produce regular JDBC result sets

2/3/2016 23

PHOENIX ON TOP OF HBASE

• The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema • Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

2/3/2016 PERFORMANCE AND MEETING YOUR SLA

• Its in the configuration: over 2400 java properties for Hadoop in 4 files • About 20 more for Hive • About 50 more for Pig • About 15 more for Spark • 400 more for HBase • … • Each framework is reliant on shell environment variables… • Forgot about the disk I/O, network issues, serialization… CONFIGURATION PROPERTIES

• Properties are pervasive throughout the components of the Hadoop Ecosystem • All components are “shipped” with “default” configuration settings that must be reviewed for applicability to each use case and cluster environment • “Administrators” must review the properties and decide the following: • What properties are the organizations “default” • What properties must be marked with the attribute • What properties are intended to be superseded by either the core-site.xml file or by job-specific parameters PERFORMANCE “TUNING” THE HADOOP CLUSTER (2)

• There is NOT a “standard” configuration for optimal performance that can be set at installation time • Google on “Hadoop Performance Tuning” for many URLs to reference on this topic. • Hadoop vendors have UIs to collect and facilitate preliminary performance analysis • is strongly suggested to start the journey • Other tools may be chosen and used • Criteria is usually based on specific function and user familiarity HDFS-DEFAULT.XML HDFS-SITE.XML • Some properties of the 1200 of immediate interest for the NameNode are:

• Some properties of immediate interest for the DataNodes are:

• Images from http://hadoop.apache.org/docs/stable2/hadoop-project- dist/hadoop-common/ClusterSetup.html HADOOP ENVIRONMENT SCRIPTS

• In addition to ensuring the configuration properties in the .xml files are set appropriately, the environment in which the Hadoop deamons execute must be set. • Setting values for environment variables that influence the Hadoop daemons are set in the following files: • Example hadoop-env.sh • https://github.com/hanborq/hadoop/blob/master/example-confs/conf.secure/hadoop- env.sh • Example yarn_env.sh • https://apache.googlesource.com/hadoop- common/+/2942a5bfbafd67655b0859d339a4e95a0b6d5044/hadoop-yarn- project/hadoop-yarn/conf/yarn-env.sh • Properties specified in either of these files (with caveats) can be superseded as job parameters (this can be command line, properties objects or vendor-specific management consoles. AMBARI CONSOLE

• http://ambari.apache.org/1.2.0/installing-hadoop-using-ambari/content/ 32

WHEN TO [NOT] USE HADOOP

• When to use Hadoop: • When not to use Hadoop: • Your data sets are really big • You need answers in a hurry • You celebrate data diversity • Your queries are complex are • You have mad programming skills require extensive optimization • You are building an enterprise data • You require random interactive hub for the future access to data • You find yourself throwing away • You want to store sensitive data perfectly good data • You want to replace your data warehouse

• (http://www.facebook.com/pages/ Datanami/124760547631010)

2/3/2016 33 SLOW FRAMEWORK, COMPLEX DATA • http://news360.com/article/246140284

2/3/2016 35

BERKELEY AMPLAB

• Began January 2011 • 18 Commerial Sponsors: • 8 Faculty, 40 students, 3 SW engineers • Cisco • • Funding from • Ericsson • DARPA • FaceBook • Xdata • GE • NSF • HortonWorks • CISE Expedition Grant • Intel • Amazon, Google, SAP • Microsoft • 18 commercial sponsors • Oracle • Splunk • VmWare • Yahoo and more…

2/3/2016 37

APPROACH TO BDAS GOALS

• Support the combination of batch, streaming and interactive computations with relative ease • A single execution model supports all computation models • Support interactive and streaming computations via use of memory and parallelism • Memory transfer rates far exceed any disk configuration capability • RAM/SSD hybrid memories are beginning to appear • Support the development of algoritms beyond simple MR or current ML algorithms such as recommendation engines and K-means clustering • Provides Python and Scala shells • Provides abstractions for graph based and ML algorithms • Be compatible with existing Hadoop/HDFS and its ecosystem • Interoperates with existing storage and input formats (HDFS, Hive, Flume, etc) • Supports existing execution models (Hive, GraphLab, etc)

2/3/2016 THE BDAS PROJECTS

• https://amplab.cs.berkeley.edu/software/ 41

MESOS • Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively • Runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments • Compatible with current Hadoop-related ASF projects • Currently supporting 3500+ servers at Twitter • – scalability to 10000s

• https://amplab.cs.berkeley.edu/projects/mesos-dynamic-resource-sharing-for-clusters/ 2/3/2016 42

SPARK • Spark is a general engine allowing the combination of multiple types of computations (e.g., SQL queries, text processing and machine learning) that with Hadoop have required learning different engines • Hadoop is disk oriented, Spark is memory oriented • [Core] Spark is now an ASF Project

2/3/2016 43

SPARK (2)

• Spark as an engine is the basis of Spark SQL, Streaming Spark, MLlib, GraphX • Spark offers simple APIs in Python, Java, Scala and SQL, and built-in libraries • Spark can run in Hadoop clusters and access any Hadoop data source

2/3/2016 44

SPARK CORE

2/3/2016 45

SPARK CORE

• Spark Core provides the basic capabilities of Spark: • Task scheduling • Memory management • Fault recovery • Storage system usage • Spark Core also provides the API that defines Resilient Distributed Datasets(RDDs) • RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel • Spark Core provides many APIs for building and manipulating these collections

2/3/2016 46

SPARK SQL

2/3/2016 47

SPARK SQL

• Spark SQL provides a SQL interface to Spark that represents database tables as Spark RDDs and translates SQL queries into Spark operations • Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Javaand Scala, all within a single application • Spark SQL was added to Spark in version 1.0

2/3/2016 48

SHARK

• Shark is a project out of UC Berkeley that pre-dates Spark SQL and is being ported to work on top of Spark SQL • Shark provides additional functionality so that Spark can act as drop-in replacement for • This includes a HiveQL shell, as well as a JDBC server that makes it easy to connect external graphing and data exploration tools

2/3/2016 50

SPARK STREAMING

2/3/2016 51

SPARK STREAMING

• Spark Streaming provides an API for manipulating data streams that closely resembles the Spark Core’s RDD API • Programmers can easily learn the project with familiarity with RDDs and move between applications that manipulate data stored in memory, on disk, or arriving in real-time • Examples of data streams: • log files generated by production web servers • queues of messages containing status updates posted by users of a web service • Spark Streaming provides the same degree of fault tolerance, throughput, and scalability that the Spark Core provides

2/3/2016 52

MLLIB

2/3/2016 53

MLLIB

• Spark comes with a library containing common machine learning (ML) functionality called Mllib • MLlib provides multiple types of machine learning algorithms, including • binary classification • Regression • Clustering • collaborative filtering • Model evaluation • Data import • MLLIb provides lower level ML primitives including a generic gradient descent optimization algorithm • All of these methods are designed to scale out across a cluster.

2/3/2016 54

GRAPHX

2/3/2016 55

GRAPHX

• GraphX is a library added in Spark 0.9 that provides an API for manipulating graphs • (e.g., a social network’s friend graph) and performing graph-parallel computations. • GraphX extends the Spark RDD API, allowing creation of a directed graph with arbitrary properties attached to each vertex and edge. • GraphX also provides set of operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank)

2/3/2016 SPARKR • R package that provides a light-weight frontend to use • Exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster • As of April 2015, SparkR has been officially merged into Apache Spark. 57

TACHYON • High-throughput, fault-tolerant in-memory storage • Compatible with HDFS • Supports Spark and Hadoop • Succinct (requires Tachyon): • Queries on Compressed RDDs

2/3/2016 58

BLINKDB • A massively parallel, approximate query engine for running interactive SQL queries on large volumes of data • Allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars • BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%. • Deployed at Facebook

2/3/2016 59 CLUSTER MANAGEMENT RESOURCE NEGOTIATION

2/3/2016 62

BDAS/HADOOP COMPATIBILITY

• Supports existing interfaces

2/3/2016 63

BDAS/HADOOP COMPATIBILITY (2)

• Uses existing interfaces

2/3/2016 SUMMARY

• The Apache Software Foundation • BigData and the role of Hadoop • Overview of Hadoop • The Hadoop Distributed File System (HDFS) • Yet Another Resource Negotiator (YARN) • Application types: • Data at Rest (Batch) • Data at Motion (Streaming) • A Brief look at some “Hadoop EcoSystem” projects • The Berkeley Data Analytic Stack • Conclusion: They are just a bunch of processes. Some designed to exploit disk, some designed to exploit memory. They can co-exist on the same servers. THANK YOU! Q & A