What Is Apache Mahout Apache Mahout Give Big Data Sets
Total Page:16
File Type:pdf, Size:1020Kb
Getting Started Development with Mahout By Reeshu Patel Getting Started Development with Mahout What is Apache Mahout Apache mahout give big data sets. Maout core algorithms for clustaring, classfication and batch primarily based cooperative filtering area Perform with the top of Apache Hadoop with the map/reduce paradigm. but we have a tendency to don't prohibit contributions to Hadoop based implementations: Contributions that run on one node or on a non-Hadoop cluster . The core libraries area very extremely optimized to allow good performance additionally for non-distributed algorithms Installing Mahout Mahout is a extremely ascendible machine learning algorithms over very big data sets. though the important thing of maout is vouched for less than on big HDFS data, however mahout itself supports running algorithm on local filesystem data, that may assist you get a feel of a way to run mahout algorithms. Before you can run any Mahout algorithm you need a Mahout installation ready on your Linux machine which can be carried out in two ways as described below. Before you'll be able to run any mahout algorithm you would like a mahout installation ready on your Linux machine which may be needed in two ways as given below. Step 1:- Now you willl have to Download mahout-distribution-0.x.tar.gz Download mahout-distribution-0.x.tar.gz from the Apache Download Mirrors and extract the contents in the Hadoop package and you can choice location itself. I haved /usr/local/hadoop. Make sure to change the complete of all the files to the hduser user and hadoop group, for example: cd /usr/local 1. $ sudo tar xzf mahout-distribution-0.x.tar.gz Getting Started Development with Mahout 2. $ sudo mv mahout-distribution-0.x.tar.gz to mahout 3. $ sudo chown -R hduser:hadoop mahout This must be result in a folder with name /path_to_downloaded_tarball/mahout-distribution-0.x Now, If u want you can used run any of the algorithms using with the script “bin/mahout” present in the extracted folder. You can check your Installtion . Now we willl have to set the path in the .bashrc file 1. export MAHOUT_HOME=/usr/local/mahout 2. path=$path:$MAHOUT_HOME/bin Create a directory where you would want to check out the Mahout code, we’ll call it here MAHOUT_HOME: 1. $ sudo mkdir -p /app/mahout 2. $ sudo chown hduser:hadoop /app/mahout 3. # ...and if you want to tighten up security, chmod from 755 to 750... 4. $ sudo chmod 750 /app/mahout Step:3- Now we willl set Hadoop_confi path in .hadoop.env.sh /usr/local/mahout/lib/* Step 4:- Now we INSTALLATION OF MAVEN . 1. $ sudo tar xzf apache-maven-2.2.1-bin.zip Getting Started Development with Mahout 2. $ sudo mv apache-maven-2.2.1-bin.zip maven $ sudo chown -R hduser:hadoop maven Now we willl set the path in the .bashrc file 1. export M2_HOME=/usr/local/maven 2. export PATH=$M2:$PATH 3. export M2=$M2_HOME/bin 4. PATH=$PATH:$JAVA_HOME/bin:$M2_HOME/bin Now you have to Install maven on mahut .lets you go mahout home directory and run maven on mahout: /usr/local/mahout$ mvn install you should see something as below : Getting Started Development with Mahout You have finised your maven installation in mahout.Now you can fined .m2 reposwtrory in root. Now we need to install hadoop on eclips. Hadoop Configuration with Eclips. My inspiration to write a blog on hadoop configuration and MapReduce task with Eclipse so that you can eassily start devlopment with mahout we have complete the stuffs but,we did not know how it works or how we can do it.It make a problems to get .This configuration willl help you to get through all that you need to run a MapReduce application on Eclipse. I know you all are familier with the Wordcount example.If not then you do not worry you willl get know all the thing step by step.This is more esily MapReduce application that you whould ever get to know . In this example we are going to find size for words and make count for similar sizes. After performed this example you can used mahout and start runing mahout example. • Here,I was using stable version of Ubuntu 12.04 , Hadoop 1.1.2 and Eclipse juno.If you want you can do it by using latest versions of them. For working environment of Hadoop first of all you need to setup hadoop cluster. Follow this link to setup http://www.attuneuniversity.com/blog/apache- hadoop-installation-with-single-node-cluster-setup.html After setting up your hadoop cluster.You need to start your hadoop cluster. Step 1: Now need to start your single node hadoop cluster 1. root@reeshupatel-desktop:~#/usr/local/hadoop$ bin/start-all.sh Step 2: Copy “Hadoop Eclipse Plugin” into Plugins directory of eclipse Getting Started Development with Mahout 1. root@reeshupatel-desktop:~#sudo cp /home/attune/Desktop/hadoop-eclipse-plugin-1.1.2.jar /opt/eclipse/plugins 2. Here,I my hadoop version is 1.1.2. So,that's why I am using hadoop-eclipse-plugin-1.1.2. Step 3: Start Eclipse IDE 1. You can see “Map/Reduce” at your right corner of your IDE,select it. Now,look at bottom of your IDE.You can see “Map/Reduce Locations”. 2. Now need Create your “New Hadoop Location” and set ports for MapReduce and dfs. Step 4: Create MapReduce project Step 5: After makeing a project you can select “MapReduce driver” for application. 1. You can give the appropriate name for your application.And start programming. Step 6: You will have to setup location Now you will have go define location and set one by one Getting Started Development with Mahout 1. local name : according to you 2. Map/reduce master host:According to core.site.xml file Exam: hdfs:// reeeshu:54311 3. Port; same as core site.xml port 54311 4. DFS master:Here Port:54310 AS we enter according to map reduce site .xml file Exam:hdfs://reeeshu:54310 5. User name:hduser Exa: If your hadoop working with hduser then 6. If you are Working with Hduser so You willl have tp give permission for eclips $chown -R hduser:hadoop hadoop <eclips folder path> 7. Set the mapreduce class path to hadoop folder 8. Set the java build path from hadoop.cor <jar >jar file 9. Here My input file contains Unstructured Text data. Map reduce Example with Hadoop If you want to check you hadoop you can follow this example .Downlode blow file for example 1. The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson 2. The Notebooks of Leonardo Da Vinci 3. Ulysses by James Joyce Copy the data from local directory to Hadoop distributed file system 4. root@reeshupatel-desktop:~#/usr/local/hadoop$ bin/hadoop fs -copyFromLocal /home/attune/Desktop/attune_1.txt/home/reeshupate/attune_text_inp ut/attune_1.txt 5. root@reeshupatel-desktop:~#/usr/local/hadoop$ bin/hadoop fs -copyFromLocal /home/attune/Desktop/attune_2.txt/home/reeshupate/attune_text_inp ut/attune_2.txt Getting Started Development with Mahout 6. root@reeshupatel-desktop:~#/usr/local/hadoop$ bin/hadoop fs -copyFromLocal /home/attune/Desktop/attune_3.txt/home/reeshupate/attune_text_inp ut/attune_3.txt As I said before this application is going to find size for words and make count of words having same size. I used two classes one is mapper class and second is reducer class.In that classes there are two functions,one is map and other is reduce function.Map function willl split the text and tokenize the words then after it willl get the size words and consider as key.Reduce function willl make it count for perticular key's(having same size). package com.attuneinfocom.size.count; import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class Sizecounting { Getting Started Development with Mahout public static class MapClass extends MapReduceBase implements Mapper<LongWritable,Text,IntWritable,Text> { IntWritable one; Text word=new Text(); public void map(LongWritable key,Text value,OutputCollector<IntWritable,Text>output,Reporter reporter) throws IOException { String line=value.toString(); StringTokenizer st=new StringTokenizer(line," "); while(st.hasMoreTokens()) { String word=st.nextToken(); int size=word.length(); output.collect(new IntWritable(size),new Text(String.valueOf(one))); } } } public static class ReduceClass extends MapReduceBase implements Reducer<IntWritable,Text,IntWritable,IntWritable> { public void reduce(IntWritable key,Iterator<Text> values,OutputCollector<IntWritable,IntWritable>output,Reporter reporter)throws IOException { Getting Started Development with Mahout int sum=0; while(values.hasNext()) { values.next(); sum++; } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) { JobClient client = new JobClient(); //Used to distinguish Map and Reduce jobs from others JobConf conf = new JobConf(com.attuneinfocom.size.count.Sizecounting.class); //Specify key and value class for Mapper conf.setMapOutputKeyClass(IntWritable.class); conf.setMapOutputValueClass(Text.class); // Specify output types conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(IntWritable.class); Getting Started