Hadoop and Tools

• Various Hadoop clusters around – http://hadoop.apache.org – Amazon EC2 • Windows and other platforms – The NetBeans plugin simulates Hadoop – The workflow view works on Windows • Hadoop-based tools – For Developing in Java, NetBeans plugin • HBase, Distributed data store as a large table • Hive, Data warehouse, SQL • Pig , a SQL-like high level data processing script language • Mahout, algorithms on Hadoop 1 Installing Hadoop http://hadoop.apache.org/

Supported Platforms • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. • Windows is also a supported platform. Required Software • Required software for Linux and Windows include: • Java^TM 1.6.x, preferably from Sun, must be installed. • ssh must be installed and sshd must be running to use the Hadoop script

2 3  HDFS – Hadoop Distributed File System modeled on Google GFS.

 Hadoop MapReduce – Similar to Google MapReduce

 Hbase – Similar to Google Bigtable  Data is divided into various tables

 Table is composed of columns, columns are grouped into column-families Example

6 Multi-dimensional map

7 Physical view

8

10 Problem with MapReduce

Hadoop supports data-intensive distributed applications using MapReduce.

However... – Map-reduce hard to program (users know /bash/python). – No schema. What is HIVE?

A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. – ETL. – Structure. – Access to different storage. – Query execution via MapReduce. Key Building Principles: – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance Data Units

Databases. Tables. Partitions. Buckets (or Clusters). Type System

Primitive types – Integers:TINYINT, SMALLINT, INT, BIGINT. – Boolean: BOOLEAN. – Floating point numbers: FLOAT, DOUBLE . – String: STRING. Complex types – Structs: {a INT; b INT}. – Maps: M['group']. – Arrays: ['a', 'b', 'c'], A[1] returns 'b'. Examples – DDL Operations

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Examples – DML Operations

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012- 02-24'); SELECTS and FILTERS

SELECT foo FROM sample WHERE ds='2012- 02-24';

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24';

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample; Aggregations and Groups

SELECT MAX(foo) FROM sample;

SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;

FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; Join

CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id); Multi table insert - Dynamic partition insert

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, ... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, ... WHERE pvs.country = 'UK';

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, ...

21 MapReduce not Good Enough?

 Restrict programming model

 Only two phases

 Single Job chain for long data flow

 Put the logic at the right phase

 Programmers are responsible for this

 Too many lines of code even for simple logic

 How many lines do you have for word count?

22 Pig to Rescure

 High level dataflow language (Pig)

 Much simpler than Java

 Simplify the data processing

 Put the operations at the apropriate phases

 Chains multiple MapReduce jobs

23 Motivation by Example

 Suppose we have user data in one file, website data in another file.

 We need to find the top 5 most visited pages by users aged 18- 25

24 In MapReduce

25 In Pig

26 Pig runs over Hadoop

27 Pig

 Data flow language

 User specify a sequence of operations to process data

 More control on the process, compared with declarative language

 Various data types supports

 Schema supports

 User defined functions supports

28 29 Machine Learning

• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Subset of Artificial Intelligence Types

• Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data Example: Clustering

• Unsupervised • Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses Example: Collaborative Filtering

• Unsupervised • Recommend people and products – User-User » User likes X, you might too – Item-Item » People who bought X also bought Y Amazon.com Example: Classification/Categorization

• Many, many types • Spam Filtering • Named Entity Recognition (NER) • Phrase Identification • Sentiment Analysis • Classification into a Taxonomy

NER? Example: Info. Retrieval

• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking Other

• Image Analysis • Robotics • Games • Higher level natural language processing • Many, many others What is ?

• A Mahout is an elephant trainer/driver/keeper, hence…

+ (and other distributed techniques) Machine Learning =

Goal : – Scalable Machine Learning algoirthms with What?

• Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance • Mahout brings: – Library of machine learning algorithms – Examples Why Mahout?

• Many Open Source ML libraries either: – Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Lack the Apache License ;-)

– Or are research-oriented Current Status

• What’s in Mahout: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering » Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet – Classifiers » Naïve Bayes » Complementary NB – Evolutionary » Integration with Watchmaker for fitness function