Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

1/16/18 Big Data Analysis using Hadoop Lecture 4 Hadoop EcoSystem Hadoop Ecosytems 1 1/16/18 Overview • Hive • HBase • Sqoop • Pig • Mahoot / Spark / Flink / Storm • Hadoop & Data Management Architectures Hive 2 1/16/18 Hive • Data Warehousing Solution built on top of Hadoop • Provides SQL-like query language named HiveQL • Minimal learning curve for people with SQL expertise • Data analysts are target audience • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amountsof data • Access to files on various data stores such as HDFS, etc Website : http://hive.apache.org/ Download : http://hive.apache.org/downloads.html Documentation : https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive • Hive does NOT provide low latency or real-time queries • Even querying small amounts of data may take minutes • Designed for scalability and ease-of-use rather than low latency responses • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster • This is changing 3 1/16/18 Hive • Hive Concepts • Re-used from Relational Databases • Database: Set of Tables, used for name conflicts resolution • Table: Set of Rows that have the same schema (same columns) • Row: A single record; a set of columns • Column: provides value and type for a single value • Can can be dived up based on • Partitions • Buckets Hive – Let’s work through a simple example 1. Create a Table 2. Load Data into a Table 3. Query Data 4. Drop a Table 4 1/16/18 Hive – 1. Create a table hive> !cat data/user-posts.txt; Values are separate by ‘,’ and each row represents a record; first value is user user1,Funny Story,1343182026191 name, second is post content and third user2,Cool Deal,1343182133839 is timestamp user4,Interesting Post,1343182154633 user5,Yet Another Blog,13431839394 hive> hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT) > ROW FORMAT DELIMITED 1st line: creates a table with 3 columns > FIELDS TERMINATED BY ',' 2nd and 3rd line: how the underlying file > STORED AS TEXTFILE; should be parsed OK 4th line: how to store data Time taken: 10.606 seconds hive> show tables; hive> describe posts; OK OK posts user string Time taken: 0.221 seconds post string time bigint Time taken: 0.212 seconds Hive - 2. Load Data into a Table hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt' > OVERWRITE INTO TABLE posts; Copying data from file:/home/hadoop/Training/play_area/data/user-posts.txt Copying file: file:/home/hadoop/Training/play_area/data/user-posts.txt Loading data to table default.posts Deleted /user/hive/warehouse/posts Existing records the table OK posts are deleted; data in Time taken: 5.818 seconds user-posts.txt is loaded into hive> Hive’s posts table $ hdfs dfs -cat /user/hive/warehouse/posts/user-posts.txt user1,Funny Story,1343182026191 user2,Cool Deal,1343182133839 user4,Interesting Post,1343182154633 user5,Yet Another Blog,13431839394 5 1/16/18 Hive – 3. Query Data hive> select count (1) from posts; Count number of records in posts table Total MapReduce jobs = 1 Transformed HiveQL into 1 MapReduce Job Launching Job 1 out of 1 ... Starting Job = job_1343957512459_0004, Tracking URL = http://localhost:8088/proxy/application_1343957512459_0004/ Kill Command = hadoop job -Dmapred.job.tracker=localhost:10040 -kill job_1343957512459_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-08-02 22:37:24,962 Stage-1 map = 0%, reduce = 0% 2016-08-02 22:37:30,497 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec 2016-08-02 22:37:31,577 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec 2016-08-02 22:37:32,664 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.64 sec MapReduce Total cumulative CPU time: 2 seconds 640 msec Ended Job = job_1343957512459_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Accumulative CPU: 2.64 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 2 seconds 640 msec OK 4 Time taken: 14.204 seconds Result is 4 records How long did it take to run ? Hive – 3. Query Data hive> select * from posts where user="user2"; Select records for "user2" ... ... OK user2 Cool Deal 1343182133839 Time taken: 12.184 seconds Select records whose hive> select * from posts where time<=1343182133839 limit 2; timestamp is less or equals ... to the provided value ... Usually there are too OK many results to display, user1 Funny Story 1343182026191 then one could utilize user2 Cool Deal 1343182133839 limit command to Time taken: 12.003 seconds bound the display hive> 6 1/16/18 Hive - 4. Drop a Table hive> DROP TABLE posts; OK Time taken: 2.182 seconds hive> exit; $ hdfs dfs -ls /user/hive/warehouse/ If hive was managing underlying file then it will be removed Hive 7 1/16/18 Partitions and Buckets Partitions divide data by grouping A bucket divide each partition or similar type of data together based the unpartitioned table into N on a column or partition key. Each Buckets based on the hash Table can have one or more function of a column(s) in the table partition keys to identify a particular partition. This allows us to have a faster query on slices of the data. Partitions and Buckets CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; 8 1/16/18 Partitions and Buckets CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS STORED AS SEQUENCEFILE; The table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Old vs New Hive • Old Hive = Map-Reduce • New Hive = Map-Reduce, Spark, Tez • Tez = an application framework which allows for a complex directed- acyclic-graph of tasks for processing data 9 1/16/18 Hive – Word Count CREATE TABLE docs (line STRING); LOAD_ DATA INPATH ‘docs’ OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, ‘\s’)) AS word FROM docs) w GROUP BY word ORDER BY word; HBase 10 1/16/18 HBase • HBase is a distributed column-oriented database built on top of the Hadoop file system. • HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of unstructured data. • Columnar oriented database • The components of HBase data model consist of tables, rows, column families, columns, cells and versions. • Tables are like logical collection of rows stored in separate partitions. • Data in a row are grouped together as Column Families. • Each Column Family has one or more Columns and these Columns in a family are stored together. Website : http://hbase.apache.org/ Download : http://www.apache.org/dyn/closer.cgi/hbase/ Documentation : http://hbase.apache.org/book.html 11 1/16/18 HBase • Non-Acid compliant database • Hbase shell – with limited range of commands • Libraries for Java and many other languages • MR supported HBase • ? 12 1/16/18 HBase HBase • Good at • Single random selects and range scans • Querying one or a small subset of columns • Data compaction as nulls are ignored • Not so good at • Transactions • Joins, Group bys, Where • Only one index for a table 13 1/16/18 Pig Pig • “is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. “ • Pig is an abstraction on top of Hadoop • Provides high level programming language designed for data processing • Converted into MapReduce and executed on Hadoop Clusters • MapReduce requires programmers • Must think in terms of map and reduce functions • More than likely will require Java programmers • Pig provides high-level language that can be used by • Analysts, Data Scientists, Statisticians, Etc... • Different type of user compared to those who write MR functions Website : http://pig.apache.org/ Download : http://pig.apache.org/releases.html Documentation : http://pig.apache.org/docs/r0.16.0/ 14 1/16/18 Pig • Pig's infrastructure layer consists of • a compiler that produces sequences of Map-Reduce programs, • Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. • Extensibility. Users can create their own functions to do special-purpose processing. • Pig Latin • Command based language • Data flow language rather than procedural or declarative • Designed specifically for data transformation and flow expression • Pig compiler converts Pig Latin to MapReduce • Compiler strives to optimize execution • You automatically get optimization improvements with Pig updates • Provides common operations like join, group, filter, sort. Pig – Examples - Aggregation • Let’s count the number of times each user appears in the data set. log = LOAD ‘excite-small.log’ AS (user, timestamp, query); Grpd = GROUP log BY user; Cntd = FOREACH grpd GENERATE group, COUNT(log) AS cnt; STORE cntd INTO ‘output’; Results: 002BB5A52580A8ED 18 005BD9CD3AC6BB38 18 .

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Evaluation of SPARQL Queries on Apache Flink

Flare: Optimizing Apache Spark with Native Compilation

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Regeldokument

Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)

Portable Stateful Big Data Processing in Apache Beam

Apache Spark Solution Guide

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019

Debugging Spark Applications a Study on Debugging Techniques of Spark Developers