Hadoop Programming Options
Total Page:16
File Type:pdf, Size:1020Kb
"Web Age Speaks!" Webinar Series Hadoop Programming Options Introduction Mikhail Vladimirov Director, Curriculum Architecture [email protected] Web Age Solutions Providing a broad spectrum of regular and customized training classes in programming, system administration and architecture to our clients across the world for over ten years ©WebAgeSolutions.com 2 Overview of Talk Hadoop Overview Hadoop Analytics Systems HDFS and MapReduce v1 & v2 (YARN) Hive Sqoop ©WebAgeSolutions.com 3 Hadoop Programming Options Hadoop Ecosystem Hadoop Hadoop is a distributed fault-tolerant computing platform written in Java Modeled after shared-nothing, massively parallel processing (MPP) system design Hadoop's design was influenced by ideas published in Google File System (GFS) and MapReduce white papers Hadoop can be used as a data hub, data warehouse or an analytic platform ©WebAgeSolutions.com 5 Hadoop Core Components The Hadoop project is made up of three main components: Common • Contains Hadoop infrastructure elements (interfaces with HDFS, system libraries, RPC connectors, Hadoop admin scripts, etc.) Hadoop Distributed File System • Hadoop Distributed File System (HDFS) running on clusters of commodity hardware built around the concept: load once and read many times MapReduce • A distributed data processing framework used as data analysis system ©WebAgeSolutions.com 6 Hadoop Simple Definition In a nutshell, Hadoop is a distributed computing framework that consists of: Reliable data storage (provided via HDFS) Analysis system (provided by MapReduce) ©WebAgeSolutions.com 7 High-Level Hadoop Architecture ©WebAgeSolutions.com 8 Hadoop-centric Analytics Systems The MapReduce (MR) engine The core data analytics component of the Hadoop project Apache Pig High-level analytics platform (auto-generates MR jobs) Apache Hive Data warehouse infrastructure platform; uses SQL-like programming language constructs that are transparently translated into MR jobs Apache Spark By-passes MR Cloudera Impala By-passes MR ©WebAgeSolutions.com 9 Other Hadoop-centric Systems and Tools Apache HBase Fast random access data store Apache Oozie Workflow management of interdependent jobs Apache Sqoop Hadoop-RDBMS import/export bridge Apache Flume Streaming data collection and HDFS ingestion framework Apache ZooKeeper Coordination, synchronization and naming registry service for distributed applications ©WebAgeSolutions.com 10 Main Hadoop Distributions Cloudera (http://www.cloudera.com) Open Source Hadoop Platform which consists of Apache Hadoop and additional key open source projects You have an option of an Express (free and unsupported) edition and Enterprise (paid supported) edition Hortonworks (http://hortonworks.com) Open Source Hadoop Platform with additional open source projects Runs on Linux, the Microsoft Windows and Azure cloud (through partnership with Microsoft) MapR (https://www.mapr.com) Offers a distribution of Hadoop with proprietary(and more efficient) components MapR was selected by Amazon Web Services (AWS) for their Elastic Map Reduce (EMR) service (a premium option) MapR is Google's technology partner ©WebAgeSolutions.com 11 Hadoop Programming Options The Hadoop Distributed File System ( HDFS ) Hadoop Distributed File System (HDFS) HDFS is a distributed, scalable, fault-tolerant and portable file system written in Java for the Hadoop framework HDFS's architecture is based on the master/slave design pattern An HDFS cluster consists of a NameNode (metadata server) holding directory information about files on DataNodes A number of DataNodes, usually one per machine in the cluster HDFS design is "rack-aware" to minimize network latency Data reliability is achieved by replicating data blocks across multiple machines ©WebAgeSolutions.com 13 HDFS HA Until Hadoop 2.0 released in May 2012, the NameNode server had been a Single Point of Failure (SPoF) Hadoop included (and still does) a server called a Secondary NameNode, which does not provide the failover capability for the NameNode The HDFS HA feature for a Hadoop cluster, when enabled, addresses the SPoF issue by providing the option of running two NameNodes in an Active/Passive configuration These two NameNodes are referred to as Active and (hot) Standby servers This HA arrangement allows a fast failover from a failed Active NameNode or because of an administrator-initiated failover to the Standby node ©WebAgeSolutions.com 14 HDFS Security Authentication is configured at Hadoop's level by turning on the service level authentication By default, Hadoop runs in non-secure mode with no authentication required: it basically helps honest developers avoid making bad mistakes; but it won't stop hackers from wreaking havoc with the system When Hadoop is configured to run in secure mode, then each user and service required to be authenticated by Kerberos Service level authorization, when turned on, checks user / service permissions for accessing a given service HDFS implements Unix-like file permissions Files have an owner, a group, and read / write permissions for local system users ©WebAgeSolutions.com 15 HDFS Operational Orientation The HDFS is designed for efficient implementation of the following data processing pattern: Write once (normally, just load the data set on the file system) Read many times (for data analysis, etc.) HDFS functionality (and that of Hadoop) is geared towards batch-oriented rather than real- time scenarios Processing of data-intensive jobs can be done in parallel ©WebAgeSolutions.com 16 Working with HDFS HDFS is NOT a Portable Operating System Interface (POSIX) compliant file system File-system commands (copy, move, delete, etc.) against HDFS can be performed in a number of ways: Through the FS shell (which supports Unix-like commands: cat, chown, ls, mkdir, mv, etc.); Using Java API for HDFS Via the C-language wrapper around the Java API ©WebAgeSolutions.com 17 Examples of HDFS Commands List HDFS recursively: hadoop fs -ls –R / Upload file into HDFS: hadoop fs -put file.dat Get file from HDFS: hadoop fs -get file.dat Remove file from HDFS skipping the .Trash bin: hadoop fs -rm -skipTrash file.dat ©WebAgeSolutions.com 18 Hadoop Programming Options MapReduce The Client – Server Processing Pattern Good for small-to-medium data set sizes Fetching 1TB worth of data might take longer than 1 hour We need a new approach … Distributed computing ! ©WebAgeSolutions.com 20 Distributed Computing Challenges Process (Software) / Hardware failures Data consistency System (High) Availability Distributed processing optimizations ©WebAgeSolutions.com 21 MapReduce Programming Model MapReduce (one word) is a programming model for processing distributed [Big] data sets There are different definitions of what MapReduce is: a programming model parallel processing framework a computational paradigm batch query processor MapReduce was first introduced by Google back in 2004 Google applied for and was granted US Patent 7,650,331 on MapReduce called "System and method for efficient large-scale data processing" ©WebAgeSolutions.com 22 MapReduce’s Programming Model The technique was influenced by functional programming languages that have the map and reduce functions MapReduce works by breaking the data processing into two phases: the map phase and the reduce phase The map and reduce phases are executed sequentially with the output of the map operation piped (emitted) as input into the reduce operation ©WebAgeSolutions.com 23 MapReduce Design Elements MapReduce's design in based on the shared- nothing architecture which leads to efficiencies in distributed computing environments where it splits the workload into units of work (independent tasks) and intelligently distributes them across available Worker nodes for parallel processing MapReduce is a linearly scalable programming model ©WebAgeSolutions.com 24 Fault-Tolerance of MapReduce MapReduce operations have a degree of fault-tolerance and built-in recoverability from certain types of run-time failures which is leveraged in production-ready systems (e.g. Hadoop) MapReduce enjoys these quality of service due to its architecture based on process parallelism Failed units of work are rescheduled and resubmitted for execution should some mappers or reducers fail (provided the source data is still available) Note: High Performance (e.g. Grid) Computing systems that use the Message Passing Interface communication for check-points are more difficult to program for failure recovery ©WebAgeSolutions.com 25 MapReduce on Hadoop Hadoop's MapReduce engine runs against files stored on HDFS The MapReduce processing model is abstracted with Hadoop's Java API Hadoop-related file I/O, networking and synchronization operations are hidden away from developers Each Mapper process is normally assigned a single HDFS block Intermediate data is written to local disk The Shuffle and sort phase combine and transfer values with the same key to the same Reducer The developer specifies the number of needed Reducers Distributed data processing in Hadoop can be done using either MapReduce version 1 (the "Classic" MapReduce, or MRv1) or the newer MapReduce version 2 running in YARN ©WebAgeSolutions.com 26 MapReduce v1 ("Classic MapReduce") Hadoop's MapReduce v1 (MRv1) framework consists of a single master JobTracker and multiple slave TaskTrackers, one TaskTracker per node A MapReduce job is launched by the JobTracker (the Classic MapReduce Master) which enlists a number of TaskTracker processes