Hadoop and Beyond
Total Page:16
File Type:pdf, Size:1020Kb
BIG DATA: HADOOP AND BEYOND Daryl Heinz [email protected] AGENDA • The Apache Software Foundation • BigData and the role of Hadoop • Overview of Hadoop • The Hadoop Distributed File System (HDFS) • Yet Another Resource Negotiator (YARN) • Application types: • Data at Rest (Batch) • Data at Motion (Streaming) • A Brief look at some “Hadoop EcoSystem” projects • The Berkeley Data Analytic Stack BIG DATA • The 3 V’s and the issue of mutability • What do you do with your current data infrastructure? • start using it IN CONCERT WITH big data frameworks • BigData can be any or all of these (and more): • Clickstream • Geographic • Sensor/Machine • Sentiment • Server Logs • Text • Big Data is poly-structured • OPEN • The ASF provides support for the Apache Community of open-source software projects, which provide software products for the public good • INNOVATION • The ASF projects are defined by collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software that leads the way in its field. • COMMUNITY • We consider ourselves not simply a group of projects sharing a server, but rather a community of developers and users. • APACHE PROJECTS • The all-volunteer ASF develops, stewards, and incubates more than 350 Open Source projects and initiatives that cover a wide range of technologies • [NOTE] This is where the “professional open-source”, “hybrid” and “proprietary” vendors step in with their “distributions” • http://www.apache.org/ ASF PROJECTS • http://www.apache.org/ APACHE HADOOP • The Hadoop project includes these modules: • Hadoop Common • The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™) • A distributed file system that provides HA of polystructured data • Hadoop YARN • A framework for job scheduling and cluster resource management. • Hadoop MapReduce • A YARN-based application type for batch parallel processing of large data sets. APACHE HADOOP • http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex HDFS OVERVIEW • The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on “commodity” or low-cost hardware • Services are fault-tolerant • Data is replicated • Increased data capacity is provided by “horizontal” as versus “vertical” scaling • A “virtual” file system “that looks like *nix” is provided to the user. • HDFS is not POSIX HDFS OVERVIEW ASSUMPTIONS AND GOALS • Hardware Failure is the norm rather than the exception. • An HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Fault detection and automatic recovery from faults is a core architectural element of HDFS • The query application (or yarn application types) specify the data type – translation to data type is part of the query (if necessary) • Storage format, data type and query data types are decoupled • The three “V”s and the question of immutability are the users resposibility • HDFS is designed for batch processing rather than interactive analysis (or streaming analysis) by users. HDFS OVERVIEW ASSUMPTIONS AND GOALS (2) • The emphasis is on high volume of data accessed rather than low latency of data access • Moving Computation is Cheaper than Moving Data • A computation is efficient when executed where the data resides • HDFS provides interfaces for applications (YARN) so they can be moved to where the data is located • Portability Across Heterogeneous Hardware and Software Platforms • The HDFS services are portable from one platform to another • Portability facilitates adoption of HDFS as a viable virtual file system for a large range of applications, including non-ASF projects HDFS ARCHITECTURE • https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html YARN OVERVIEW • Yet Another Resource Negotiator • YARN is the computation framework of Hadoop whereas HDFS is the storage framework of Hadoop • YARN and its services can support multiple application types – both batch (data at rest) and streaming (data in motion) oriented • Two important points: • The Resource Manager can be configured for HA • YARN provides an API to bring legacy and new applications under the YARN resource management and application HA • Refer to http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/WritingYarnApplications.html#Writing_a_simple_Client YARN COMPONENTS The ResourceManager communicates the NodeManager(s) (NM) for status of tasks associated with the application type task running on its server. The ResourceManager arbitrates resources among all the applications in the system. The ApplicationMaster is an application-type specific library responsible for negotiating resources from the ResourceManager for its application type and working with the NodeManager(s) to execute and monitor the applications, or tasks. • http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/YARN.html HADOOP ECOSYSTEM • Many Apache projects, considered part of the “Hadoop Ecosystem” may be used without any reference to Hadoop. The following is only a sampling: • Avro™: A data serialization system • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed NoSQL, key-value database that supports structured data storage for large tables • Hive™: A SQL-presenting framework that provides abstraction over the M/R paradigm • Kafka™ : A high-throughput distributed messaging system • Mahout™: A Scalable machine learning and data mining library • Mesos: Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively • Nifi: Successor to Flume. Supports scalable directed graphs of data routing, transformation, and system mediation logic. • Phoenix: A JDBC “skin” around HBase • Pig™: A high-level data-flow language and execution framework for parallel computation • Samza: a distributed stream processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management • Spark: a fast and general engine for large-scale memory or disk-resident data processing • ZooKeeper™: A high-performance coordination service for distributed applications. PIG • Found in various locations on the web HIVE • Defining a table: hive> CREATE TABLE mytable (name chararray, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; • ROW FORMAT is a Hive-unique command that indicate that each row is comma delimited text • HiveQL statements are terminated with a semicolon ';' • Other table operations: • SHOW TABLES • CREATE TABLE • ALTER TABLE • DROP TABLE • Courtesy Hortonworks WHAT REALLY HAPPENS WITH HIVE • Courtesy Hortonworks 20 NIFI • NiFi automates system-to-system dataflow • dataflow: the automated and managed flow of information between systems • dataflow patterns: Gregor Hohpe. Enterprise Integration Patterns • http://www.enterpriseintegrationpatterns.com/ • Apache NiFi provides directed graphs of • data routing • Transformation • system mediation logic • Documentation: http://nifi.apache.org/docs.html 2/3/2016 21 NIFI OBJECTIVES (2) • Web-based UI • Configurable • Flow can be modified at runtime • Data Provenance • Track dataflow from beginning to end • Designed for extension • Build your own processors • Enables rapid development and effective testing • Secure • SSL, SSH, HTTPS, encrypted content, etc... • Pluggable role-based authentication/authorization 2/3/2016 22 PHOENIX • A relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data • Compiles a SQL query into a series of HBase scans • The running of the scans is orchestrated to produce regular JDBC result sets 2/3/2016 23 PHOENIX ON TOP OF HBASE • The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema • Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. 2/3/2016 PERFORMANCE AND MEETING YOUR SLA • Its in the configuration: over 2400 java properties for Hadoop in 4 files • About 20 more for Hive • About 50 more for Pig • About 15 more for Spark • 400 more for HBase • … • Each framework is reliant on shell environment variables… • Forgot about the disk I/O, network issues, serialization… CONFIGURATION PROPERTIES • Properties are pervasive throughout the components of the Hadoop Ecosystem • All components are “shipped” with “default” configuration settings that must be reviewed for applicability to each use case and cluster environment • “Administrators” must review the properties and decide the following: • What properties are the organizations “default” • What properties must be marked with the <final> attribute • What properties are intended to be superseded by either the core-site.xml file or by job-specific parameters PERFORMANCE “TUNING” THE HADOOP CLUSTER (2) • There is NOT a “standard” configuration for optimal performance that can be set at installation time • Google on “Hadoop Performance Tuning” for many URLs to reference on this topic. • Hadoop vendors have UIs to collect and facilitate preliminary performance analysis • Apache Ambari is strongly suggested to start the journey