Hadoop Mapreduce – Similar to Google Mapreduce

Hadoop and Tools • Various Linux Hadoop clusters around – http://hadoop.apache.org – Amazon EC2 • Windows and other platforms – The NetBeans plugin simulates Hadoop – The workflow view works on Windows • Hadoop-based tools – For Developing in Java, NetBeans plugin • HBase, Distributed data store as a large table • Hive, Data warehouse, SQL • Pig , a SQL-like high level data processing script language • Mahout, Machine Learning algorithms on Hadoop 1 Installing Hadoop http://hadoop.apache.org/ Supported Platforms • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. • Windows is also a supported platform. Required Software • Required software for Linux and Windows include: • Java^TM 1.6.x, preferably from Sun, must be installed. • ssh must be installed and sshd must be running to use the Hadoop script 2 3 HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google Bigtable Data is divided into various tables Table is composed of columns, columns are grouped into column-families Example 6 Multi-dimensional map 7 Physical view 8 10 Problem with MapReduce Hadoop supports data-intensive distributed applications using MapReduce. However... – Map-reduce hard to program (users know sql/bash/python). – No schema. What is HIVE? A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. – ETL. – Structure. – Access to different storage. – Query execution via MapReduce. Key Building Principles: – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance Data Units Databases. Tables. Partitions. Buckets (or Clusters). Type System Primitive types – Integers:TINYINT, SMALLINT, INT, BIGINT. – Boolean: BOOLEAN. – Floating point numbers: FLOAT, DOUBLE . – String: STRING. Complex types – Structs: {a INT; b INT}. – Maps: M['group']. – Arrays: ['a', 'b', 'c'], A[1] returns 'b'. Examples – DDL Operations CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Examples – DML Operations LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012- 02-24'); SELECTS and FILTERS SELECT foo FROM sample WHERE ds='2012- 02-24'; INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample; Aggregations and Groups SELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; Join CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id); Multi table insert - Dynamic partition insert FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, ... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, ... WHERE pvs.country = 'UK'; FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, ... Apache Pig 21 MapReduce not Good Enough? Restrict programming model Only two phases Single Job chain for long data flow Put the logic at the right phase Programmers are responsible for this Too many lines of code even for simple logic How many lines do you have for word count? 22 Pig to Rescure High level dataflow language (Pig) Much simpler than Java Simplify the data processing Put the operations at the apropriate phases Chains multiple MapReduce jobs 23 Motivation by Example Suppose we have user data in one file, website data in another file. We need to find the top 5 most visited pages by users aged 18- 25 24 In MapReduce 25 In Pig 26 Pig runs over Hadoop 27 Pig Data flow language User specify a sequence of operations to process data More control on the process, compared with declarative language Various data types supports Schema supports User defined functions supports 28 29 Machine Learning • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Subset of Artificial Intelligence Types • Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data Example: Clustering • Unsupervised • Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses Example: Collaborative Filtering • Unsupervised • Recommend people and products – User-User » User likes X, you might too – Item-Item » People who bought X also bought Y Amazon.com Example: Classification/Categorization • Many, many types • Spam Filtering • Named Entity Recognition (NER) • Phrase Identification • Sentiment Analysis • Classification into a Taxonomy NER? Example: Info. Retrieval • Learning Ranking Functions • Learning Spelling Corrections • User Click Analysis and Tracking Other • Image Analysis • Robotics • Games • Higher level natural language processing • Many, many others What is Apache Mahout? • A Mahout is an elephant trainer/driver/keeper, hence… + (and other distributed techniques) Machine Learning = Goal : – Scalable Machine Learning algoirthms with Apache License What? • Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance • Mahout brings: – Library of machine learning algorithms – Examples Why Mahout? • Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented Current Status • What’s in Mahout: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering » Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet – Classifiers » Naïve Bayes » Complementary NB – Evolutionary » Integration with Watchmaker for fitness function.

Load more