Hadoop and Tools
• Various Linux Hadoop clusters around – http://hadoop.apache.org – Amazon EC2 • Windows and other platforms – The NetBeans plugin simulates Hadoop – The workflow view works on Windows • Hadoop-based tools – For Developing in Java, NetBeans plugin • HBase, Distributed data store as a large table • Hive, Data warehouse, SQL • Pig , a SQL-like high level data processing script language • Mahout, Machine Learning algorithms on Hadoop 1 Installing Hadoop http://hadoop.apache.org/
Supported Platforms • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. • Windows is also a supported platform. Required Software • Required software for Linux and Windows include: • Java^TM 1.6.x, preferably from Sun, must be installed. • ssh must be installed and sshd must be running to use the Hadoop script
2 3 HDFS – Hadoop Distributed File System modeled on Google GFS.
Hadoop MapReduce – Similar to Google MapReduce
Hbase – Similar to Google Bigtable Data is divided into various tables
Table is composed of columns, columns are grouped into column-families Example
6 Multi-dimensional map
7 Physical view
8
10 Problem with MapReduce
Hadoop supports data-intensive distributed applications using MapReduce.
However... – Map-reduce hard to program (users know sql/bash/python). – No schema. What is HIVE?
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. – ETL. – Structure. – Access to different storage. – Query execution via MapReduce. Key Building Principles: – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance Data Units
Databases. Tables. Partitions. Buckets (or Clusters). Type System
Primitive types – Integers:TINYINT, SMALLINT, INT, BIGINT. – Boolean: BOOLEAN. – Floating point numbers: FLOAT, DOUBLE . – String: STRING. Complex types – Structs: {a INT; b INT}. – Maps: M['group']. – Arrays: ['a', 'b', 'c'], A[1] returns 'b'. Examples – DDL Operations
CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012- 02-24'); SELECTS and FILTERS
SELECT foo FROM sample WHERE ds='2012- 02-24';
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24';
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample; Aggregations and Groups
SELECT MAX(foo) FROM sample;
SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;
FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; Join
CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id); Multi table insert - Dynamic partition insert
FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, ... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, ... WHERE pvs.country = 'UK';
FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, ... Apache Pig
21 MapReduce not Good Enough?
Restrict programming model
Only two phases
Single Job chain for long data flow
Put the logic at the right phase
Programmers are responsible for this
Too many lines of code even for simple logic
How many lines do you have for word count?
22 Pig to Rescure
High level dataflow language (Pig)
Much simpler than Java
Simplify the data processing
Put the operations at the apropriate phases
Chains multiple MapReduce jobs
23 Motivation by Example
Suppose we have user data in one file, website data in another file.
We need to find the top 5 most visited pages by users aged 18- 25
24 In MapReduce
25 In Pig
26 Pig runs over Hadoop
27 Pig
Data flow language
User specify a sequence of operations to process data
More control on the process, compared with declarative language
Various data types supports
Schema supports
User defined functions supports
28 29 Machine Learning
• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Subset of Artificial Intelligence Types
• Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data Example: Clustering
• Unsupervised • Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses Example: Collaborative Filtering
• Unsupervised • Recommend people and products – User-User » User likes X, you might too – Item-Item » People who bought X also bought Y Amazon.com Example: Classification/Categorization
• Many, many types • Spam Filtering • Named Entity Recognition (NER) • Phrase Identification • Sentiment Analysis • Classification into a Taxonomy
NER? Example: Info. Retrieval
• Learning Ranking Functions
• Learning Spelling Corrections
• User Click Analysis and Tracking Other
• Image Analysis • Robotics • Games • Higher level natural language processing • Many, many others What is Apache Mahout?
• A Mahout is an elephant trainer/driver/keeper, hence…
+ (and other distributed techniques) Machine Learning =
Goal : – Scalable Machine Learning algoirthms with Apache License What?
• Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance • Mahout brings: – Library of machine learning algorithms – Examples Why Mahout?
• Many Open Source ML libraries either: – Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Lack the Apache License ;-)
– Or are research-oriented Current Status
• What’s in Mahout: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering » Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet – Classifiers » Naïve Bayes » Complementary NB – Evolutionary » Integration with Watchmaker for fitness function