Your Expert Guide to Hadoop Big Data Platforms
Total Page:16
File Type:pdf, Size:1020Kb
E-guide Hadoop Big Data Platforms Buyer’s Guide – part 3 Your expert guide to Hadoop big data platforms E-guide In this e-guide A look at Amazon Elastic MapReduce A look at Amazon Elastic cloud-based Hadoop MapReduce cloud-based Hadoop Abie Reifer, DecisionWorx Learn more about the Cloudera The Amazon Elastic MapReduce Web service offers a managed Hadoop distribution Hadoop framework that enables users to distribute and process big data across dynamically scalable Amazon EC2 instances. Inside the Hortonworks open enterprise Hadoop distribution Amazon Elastic MapReduce provides users access to a cloud-based Hadoop implementation for analyzing and processing large amounts of data. Built on top Inside the IBM BigInsights of Amazon's cloud services, EMR leverages Amazon's Elastic Compute Cloud platform for big data and Simple Storage services, enabling users to provision a Hadoop cluster management quickly. Amazon's cloud elasticity and setup tools also give users a way to temporarily Inside the MapR Hadoop distribution for managing big scale up a cloud-based Hadoop cluster for short-term increased computing data capacity. Amazon EMR lets users focus on the design of their workflow without the distractions of configuring a Hadoop cluster. As with other Amazon cloud Inside the Microsoft Azure services, users pay for only what they use. HDInsight cloud infrastructure Page 1 of 28 E-guide In this e-guide Amazon Elastic MapReduce features A look at Amazon Elastic The current version of Amazon EMR, 4.3.0, bundles several open source MapReduce cloud-based applications, a set of components for users to monitor and manage cluster Hadoop resources, and components that enable application and cluster interoperability with other services. Learn more about the Cloudera Hadoop distribution The following open source applications come bundled as part of Amazon: Apache Hadoop 2.7.1 Inside the Hortonworks open enterprise Hadoop distribution Apache Hive 1.0.0 Inside the IBM BigInsights Apache Mahout 0.11.0 platform for big data management Apache Pig 0.14.0 Apache Spark Inside the MapR Hadoop distribution for managing big Hue data Ganglia 3.7.2 Inside the Microsoft Azure HDInsight cloud infrastructure AWS Elastic MapReduce also provides users with the option of using MapR's Hadoop distribution in place of Apache Hadoop. Page 2 of 28 E-guide The EMR Web service supports several file system storage options used for In this e-guide data processing. These include Hadoop Data File System for local and remote file systems and S3 buckets using EMR File System as well as other Amazon A look at Amazon Elastic data services. Amazon EMR also integrates with several data services, MapReduce cloud-based including Amazon Dynamo DB, a fast NoSQL database; Amazon Relational Hadoop Database Service; Amazon Glacier; Amazon Redshift, a petabyte data warehouse service; and AWS Data Pipeline, a service used to move data Learn more about the Cloudera between AWS services. Hadoop distribution Other AWS Elastic MapReduce features enable users to perform the following Inside the Hortonworks open tasks: enterprise Hadoop distribution Provision an EMR cluster. An EMR management console helps users quickly navigate through the process of spinning up and autoconfiguring an EMR Inside the IBM BigInsights platform for big data instance. Through the console, users select the applications from the EMR management bundle to install, the types of server instances to use for the cluster nodes, and the security access policies and controls for the cluster. Inside the MapR Hadoop Load data into the cluster. Users with typical size data needs can transfer distribution for managing big data to an Amazon S3 bucket to be available to the cluster for processing. data Users with petabyte-scale needs may opt to use AWS Snowball, a secure, high- speed appliance that's shipped to the user, or AWS Direct Connect, an Inside the Microsoft Azure established high-speed data connection between AWS and the user's data HDInsight cloud infrastructure center. Page 3 of 28 E-guide Monitor and manage. Amazon EMR collects metrics that are used to track In this e-guide progress and measure the health of a cluster. While these metrics can be accessed through the command line interface, software developer kits or APIs, A look at Amazon Elastic they can also be viewed through the EMR management console. Additionally, MapReduce cloud-based Amazon CloudWatch can also be used along with Apache Ganglia to monitor Hadoop the cluster and set alarms on events triggered by these metrics. Learn more about the Cloudera Hadoop distribution AWS Elastic MapReduce pricing Inside the Hortonworks open Amazon's EMR pricing model is based on the company's approach to pricing for enterprise Hadoop distribution its other Web services. Users pay per amount of time and the types of instance servers used. Spot instances can also be used for some or all of the nodes in a Inside the IBM BigInsights cluster, providing users with a level of elasticity that can be changed based on platform for big data their dynamic computing needs. management Amazon provides developers with a wide range of online technical documentation, guides, tutorials and sample code. Inside the MapR Hadoop distribution for managing big data Next article Inside the Microsoft Azure HDInsight cloud infrastructure Page 4 of 28 E-guide In this e-guide Learn more about the Cloudera Hadoop distribution A look at Amazon Elastic MapReduce cloud-based Abie Reifer, DecisionWorx Hadoop Cloudera distribution, including Apache Hadoop, provides an Learn more about the Cloudera analytics platform and the latest open source technologies to store, Hadoop distribution process, discover, model and serve large amounts of data. Inside the Hortonworks open CDH, the Cloudera Hadoop distribution, includes several related open source enterprise Hadoop distribution projects, such as Impala and Search. It also provides security and integration with several hardware and software products. Inside the IBM BigInsights The Impala framework in Cloudera distribution including Apache Hadoop allows platform for big data users to execute interactive SQL queries directly against data stored in Hadoop management Distributed File System (HDFS), Apache HBase or the Amazon Simple Storage Service. Impala uses several technologies and components from Hive, including Inside the MapR Hadoop SQL syntax (Hive SQL), Open Data Base Connectivity driver and Impala's distribution for managing big data Query UI (Hue is also used by Hive). As part of CDH, Cloudera Search incorporates Apache Solr, a data indexing Inside the Microsoft Azure and search platform based on Lucene. The integration of this technology as part HDInsight cloud infrastructure of CDH provides users with near real-time indexing of and access to data directly stored in Hadoop and HBase. Solr indexing and search technology Page 5 of 28 E-guide enables users to perform complex textual searches while requiring little or no In this e-guide SQL or programming skills. Solr also allows for queries to be performed directly against the Hadoop data store, removing the need to move large data sets to A look at Amazon Elastic perform complex queries. MapReduce cloud-based Hadoop Other related open source projects included in CDH from Apache are Flume, HBASE, Hive, Hue, Oozie, Spark, Sqoop and Sentry (incubating). Learn more about the Cloudera Hadoop distribution Editions of the Cloudera Hadoop distribution Inside the Hortonworks open enterprise Hadoop distribution Cloudera offers several implementation editions of CDH that provide differing levels of cluster and service management capabilities as well as different levels of support: Inside the IBM BigInsights platform for big data Cloudera Express is free to use and includes CDH, as well as core features of management Cloudera Manager. Inside the MapR Hadoop Cloudera Manager provides CDH administrators with an intuitive Web-based distribution for managing big management console to deploy, manage, monitor and diagnose issues with data CDH deployments. The tool also includes an API that can be used to programmatically configure the system and collect metric and health information Inside the Microsoft Azure about a CDH cluster. HDInsight cloud infrastructure Page 6 of 28 E-guide Cloudera Enterprise is a licensed edition that provides extended capabilities to In this e-guide CDH with the inclusion of additional advanced features from Cloudera Manager and Navigator. Technical support options are also available to customers that A look at Amazon Elastic have purchased an enterprise license. Cloudera Enterprise is available in three MapReduce cloud-based editions, each offering varying levels of service management capabilities: Hadoop The Basic edition provides management capabilities to support a cluster Learn more about the Cloudera running core CDH services that include HDFS, Hive, Hue, MapReduce, Hadoop distribution Oozie, Sqoop, Yet Another Resource Negotiator (YARN) and ZooKeeper. The Flex edition supports the management of a cluster running core CDH Inside the Hortonworks open enterprise Hadoop distribution services plus one of the following: Accumulo, HBase, Impala, Navigator, Solr or Spark. Inside the IBM BigInsights The Data Hub edition supports the management of a cluster running core platform for big data CDH services plus any of the following: Accumulo, HBase, Impala, management Navigator, Solr or Spark. Inside the MapR Hadoop Cloudera Manager Advanced Features add the following to the core