Summer Internship
Report On Design & Implementation of Classification & Clustering Algorithm for Mobile Phones
At
IDRBT, Hyderabad
6th May 2014 to 5th July 2014
Submitted By:
Ratnesh Chandak
B. Tech- CSE (2nd Year)
Roll Number: CS12B1030
Indian Institute of Technology, Hyderabad
Guided By:
Dr. V.N. Sastry
Designation: Professor
IDRBT, Hyderabad
Date of Submission: 4th July 2014
1
Abstract
This project is broadly divided in two parts, in the first part we would be finding intersection of common objects from different sets taking an example of finding common colored balls from different sets of containers.
In second part we would describe K-Means Clustering Algorithm to find similar numerical sets and we will experimentally find optimal value of “K” in clustering algorithm by using Elbow Method. Both the these are explained with an example of mobiles phones and done implementation in java.
2
Certificate
Certified that this is a bonafide record of summer internship project work entitled
Design & Implementation of Classification & Clustering Algorithm for Mobile Phones
Done By
Ratnesh Chandak
B. Tech - CSE (2nd Year)
Indian Institute of Technology, Hyderabad
at IDRBT, Hyderabad during 6th May to 5th July 2014
Prof. V N Sastry
(Project Guide)
IDRBT, Hyderabad
3
Acknowledgement
We have completed this project as summer internship in IDRBT (Institute for Development and Research in Banking Technology), Hyderabad under the guidance of Dr. V.N. Sastry. We would like to thank all our friends and our mentor for great support in completing this project.
Ratnesh Chandak Date:
(Project Trainee)
4
Contents
Chapter 1 Introduction
1.1 Project Objectives
1.2 Classification
1.3 Clustering
Chapter 2 Container-Ball Algorithm
2.1 Algorithm
2.2 Application
2.3 Remarks
Chapter 3 Cluster Analysis of Mobile Phones
3.1 Assigning weights to Mobile Phones
3.2 K-Means Clustering Algorithm
3.3 Finding Optimal K for Clustering Algorithm
3.4 Observations
Chapter 4 Conclusion and Future Work
Appendix A: Program of Weight calculation & K-Means Clustering Algorithm
References
5
Chapter 1: Introduction
1.1 Project Objectives
1. To find intersection of same colored balls of same sizes from a set of containers. 2. To do Cluster Analysis of Mobile phones available in market.
1.2 Classification
Classification refers to the task of predicting a class label for given unlabeled objects in the dataset.
Example:
1. A Bank loan officer needs analysis of past data to learn which loan applicants are safe and who are risky and accordingly the officer provides loan.
2. A marketing manager at a company needs to analyze to guess whether a customer with a given profile will buy a new computer.
How does Classification works?
Classification is a two step process:
1. Training Phase (Learning step): Using the past history (training data), a classification algorithm is build which defines classification rules for the Classifier. 2. Classification Phase (Labeling Step): The Classifier according to the classification rules gives label (class) to any new object.
1.2 Clustering
Clustering refers to grouping a set of objects in such a way that objects in the same cluster (group) are more similar (depending on their some or the other features) to each other than to those in other clusters.
Example:
1. Clustering helps in classifying documents on web for information discovery. 2. It can be used in Earthquake studies, city planning, market research, pattern recognition, data analysis, and image processing etc.
6
There are different types of clustering algorithm used to do cluster analysis, mainly:
1. Hierarchical Clustering 2. Centroid- Based Clustering 3. Distribution Based Clustering 4. Density-Based Clustering
7
Chapter 2: Container Ball Algorithm
2.1 Objective: Finding common number of same colored balls from different sets of containers. User Input:
1. Number of Containers (M) 2. Number of Colors (N). The input can be from console or from input file
Assumption: The number of containers should be greater than one.
Algorithm:
3. Sort the container in increasing order of the number of balls by Quicksort. 4. Check the number of container (M) A. If “M” is even, divide the M containers into group of two consecutive containers. Now find the intersection of common colored balls in the two containers in all such M/2 groups. And now delete the parent containers and form child containers whose composition is common colored balls from its parent containers. Now make M -> M/2. B. If “M” is odd, divide first M-1 containers into group of two consecutive containers. Now find the intersection of common colored balls in the two containers in all such (M-1)/2 groups. And now delete the parent containers and form child containers whose composition is common colored balls from its parent containers. Now make M -> (M-1/2) +1 5. Repeat steps (1) & (2) until we have only one container whose composition is common colored balls.
Pseudo Code:
A. Main (container M) 1. Container M = new Container[] 2. QUICKSORT(M) 3. while M.length ≠ 1 4. if M.length % 2 == 0 5. Container temp = new Container[] 6. int j=0, i=0 7. while i < M.length/2
8
8. temp[j] = combine (M[i], M[i+1]) 9. i = i+2 10. j = j+1 11. M = temp 12. else if M.length % 2 == 1 13. Container temp = new Container[] 14. int j=0, i=0 15. while i < (M.length-1)/2 16. temp[j] = combine (M[i], M[i+1]) 17. i = i+2 18. temp[j] = M[M.length] 19. M = temp
B. Container Class 1. String name 2. String[] colorarray= new String[] 3. int[] freqarray = new int[] 4. public Container(String name) 5. this.name=name; 6. public void defineColor(String name2,int b,int c) 7. Color2 temp2= new Color2(name2); 8. temp2.freq(b); 9. colorarray[c]=temp2.name; 10. freqarray[c]= temp2.frequency;
C. Color Class 1. String name 2. int frequency 3. public Color2(String name) 4. this.name=name 5. public void freq(int a) 6. frequency=a
D. combine(Container A, Container B) 1. Container2 commondata = new Container2("tempcontainer"); 2. n = M.length 3. for(int k=0;k 9 5. if(firstone.colorarray[k].compareTo(secondone.colorarray[l])==0){ 6. if(firstone.freqarray[k]<=secondone.freqarray[l]) 7. { commondata.defineColor(firstone.colorarray[k],firstone.freqarray[k) } 8. else if(firstone.freqarray[k]>secondone.freqarray[l]) 9. { commondata.defineColor(firstone.colorarray[k],secondone.freqarray[l]) } 10. return commondata Container 1 Container 2 ….Container M Size1 Size2 ..Size a Size 1 Size2 ..Sizeb Size 1 Size 2 ..Size m Color 1 Color 2 Color 3 ..Color N Figure 2.1 M Containers with containing N colored balls of different sizes. Example: Input: Container 1 Container 2 Container 3 Size 1 Size 2 Size 3 Size 1 Size 2 Size 3 Size 1 Size 2 Size 3 Color 1 4 42 13 45 12 6 55 29 1 Color 2 8 12 7 35 17 7 23 5 5 Color 3 3 10 2 32 29 31 32 3 34 Color 4 12 11 23 11 27 71 21 0 2 Output table containing of common colored balls of similar sizes. Size 1 Size 2 Size 3 Color 1 4 12 1 Color 2 8 5 5 Color 3 3 3 2 Color 4 11 0 2 2.2 Related Applications 1. True Caller Mobile Application. 2. In Facebook “People You May Know” Feature. 10 2.3 Remarks 1. This application can be used by Market manager of mobile application company, if he/she wants to estimate the number of mobiles phones which are compatible to their new application by finding intersection of features of their application from different sets of mobile. 2. This application can also be used as Idea of “Diet Check” per week, to know which all food components one is eating in each week by finding intersection of proteins, vitamins etc. from each day. This can be used even by Hospitals to regulate the amount of nutrients going to patients. 11 Chapter 3: Cluster Analysis of Mobile Phones In this project we would present about Centroid-Based clustering algorithm namely K-Means Clustering Algorithm. In this project we will take example of mobiles phones in market and we will group them in appropriate clusters and for any new launched mobile in market we will apply our method and we can know the appropriate clusters. But before moving to K-Means Clustering Algorithm, you have to know that there is a limitation to K-Means, it can be applied only to numerical databases, so to overcome that condition in case of mobile phones, we have given each mobile a weight. 5.1 Assigning weights to mobile phones To assign weight to mobile phones, follow below steps: 1. List all the features with which someone want to compare. 2. Assign priority order to each of the features 3. Give (X,Y) coordinate to each of the features as follows: a. Take some increasing function X= F(Z) and give the value of Z as 1,2,3,……n (“n” corresponds to number of features) b. Now take an increasing function for Y, Y= F(X). 4. For any given mobile we now have fixed number of features. Now calculate distances of all those features which are present in that mobile phone from origin (0,0 ). 5. Calculate the standard deviation of distances of each mobile that will Corresponds to its weight. 12 In this project, we have considered the following features and the number before each feature corresponds to its priority order. 1. 2G 12. Document Viewer 23. Accelerometer 34. 2 GB RAM 2. Bluetooth 13. 3G 24. Screen Protection 35. 4G version 2.0 3. Radio 14. GPRS 25. Geo - Tagging 36. GPU (Graphic Card) 4. Mp3 Player 15. EDGE 26. Face Detection 37. Compass 5. Photo viewer/ 16. Wi-Fi 27. Dual core 38. Proximity sensor editor processor 6. Camera MP 17. GPS 28. 512 MB RAM 39. Gyroscope 0.3-2.0 7. MMS support 18. 256 MB RAM 29. Wi-Fi hotspot 40. NFC 8. Bluetooth 19. Camera MP 30. 1 GB RAM 41. HDMI port version 3.0 8.0-13.0 9. Touchscreen 20. HD Recording 31. Bluetooth version 42. Barometer 4.0 10. Camera MP 21. Java support 32. Camera MP 43. 3 GB RAM 2.0-7.9 above 8.0 11. USB 22. Mp4 Player 33. Quad core 44. Temperature processor sensor Figure 3.1 Table representing priority order of features of mobiles. Functions and formula used for calculating weight of Mobile 1. Here in this project we have used X=3*Z +1 & Y=X 2. Distance from origin = √ 3. Mean = ∑ 4. Variance = ∑ 5. Standard Deviation = √ 13 Figure 3.2 Price verses Weight analysis of 100 mobile dataset in our study representing sparse distributions of mobile phones. Later we would be forming clusters from these numerical values of mobiles. 60000 50000 40000 30000 Price 20000 10000 0 0 10 20 30 40 50 60 Weight For Java code Implementation of calculating weight of Mobile Phone check Appendix A 5.1.1 5.2 K-Means Clustering Algorithm Input: 1. Numerical Dataset of Mobile phones 2. Desired number of clusters (K) Idea: Each cluster has its own centroid, the different elements are grouped according to distances from different centroids. Algorithm: 1. Randomly choose k-weights and make them as initial centroids. 2. For each weight find the nearest centroid and assign that weight to the cluster associated with the nearest centroid. 3. Update the centroid of each cluster based on weights in that cluster. Typically the new centroid will be the mean of all weights in the cluster. 4. Repeat steps two and three till no point switches cluster. 14 Screenshots of implementation of K-Means Clustering Algorithm: Figure 3.3 Java GUI for entering feature for each mobile phone by user. Figure 3.4 Java GUI asking the cluster size from user. 15 Figure 3.5 Output of similar mobile phones which the user has entered. For Java code Implementation of K-Means Clustering Algorithm check Appendix A 5.1.2 3.3 Finding Optimal K for clustering algorithm 5.2.1 Thumb rule: K= greatest integer of ( √ ) In this project, we made dataset of 100 Mobile phones, so optimal K by thumb rule = 7 5.2.2 Elbow Method: 1. Calculate the sum of square distance error (SSE) between each member of a cluster member and its cluster centroid. 2. Plot a graph between SSE and Number of Clusters. 3. If we increase value of “K”, the SSE will decrease as more clusters hence smaller distances to centroids. But each successive increase in K will not give same drop. At some point marginal loss will drop, giving the angle in the graph. 4. We call that value of K ,the elbow and that is the optimal value of K 16 10000 9000 8000 7000 6000 SSE 5000 4000 3000 2000 1000 0 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 Number of Clusters Figure 3.6 Experimental graph of SSE verses Number of Clusters Optimal K from above graph = (5 + 11)/2 = 8 After finding the optimal K by Elbow Method, we found K=8, so putting K =8 in our cluster size, we obtain clusters as follows in our output file: Set1= { Canvas Turbo, HTC Desire, Apple ipad 5, LG Optimus F7, Motorola Moto G, Sony Tablet P, Toshiba Thrive, Spice Mi-530, Nokia Lumia 610, Samsung ATIV , Continuum I400, Nokia N9, micromax a76, } Set2= { HTC Flyer, LG P505, Motorola Fire, Nokia N8, Nokia X7-00, HP Veer, Ascend Y530, Acer Liquid Z5, Nexus 7, Alcatel OT-916, Xolo A600, Micromax A089, Micromax A121, Micromax A111, Nokia E7, Nokia T7, Samsung Epic, Geeksphone Peak, } Set3= { HTC E8, Toshiba Excite, Nokia 808, Samsung P8510, Galaxy Nexus, Alcatel Idol X+, } 17 Set4= { Asha 230, Apple iPhone, BlackBerry Curve, Samsung Gravity, Huawei U8300, Acer E210, Alcatel T10, Vodafone 845, Xolo X500, Micromax A47, Micromax Ninja A91, Micromax A52, Spice M-6900, Spice Mi-437, Spice Mi-506, Micromax P600, i8510 INNOV8, I8000 Omnia II, Nokia N900, ZTE open, } Set5= { Nokia 225, HTC Viva, BlackBerry 7290, LG C360, LG CF360, Motorola A1010, Motorola C336, Nokia 3210, Nokia C2-02, Nokia E61i, Samsung D428, Samsung E740, Samsung S5630C, HP iPAQ 6515, Huawei T552, Huawei G6620, Huawei G6310, Acer M900, Asus P526, Asus P750, Vodafone 553, Vodafone 1231, Toshiba TS32, Micromax Q1, Micromax Q80, Spice M-5390, Spice QT-58, Spice QT-50, Nokia N78, Nokia 7510, Nokia X3, Nokia E63, Nokia C2-03, Samsung G600, } Set6= { Xperia Z, Lumia 930, BlackBerry Q5, LG G3, } Set7= { GalaxyAce, BlackBerry Z3, Acer Zenfone, Alcatel Fire S, } Set8= { Galaxy S5, Xperia Z2, } 3.4 Observations 1. Our method of clustering removes the limitation to K-Means Clustering Algorithm which was its limitation to numerical dataset, now we showed that assigning weight then applying the clustering algorithm solves our problem. 2. In our study we found K = 8 as optimal value of cluster size in our dataset of 100 mobile phones. 18 Chapter 4: Conclusion and Future Work 1. With the first application, one can find the intersection between different sets. 2. With second application, one can separate different types of mobiles from market under appropriate clusters and can use similar idea of assigning weight and clustering for other elements. 3. In future one can implement above problem with other clustering algorithms and compare each algorithm according to its accuracy, user-friendliness and time- complexity. 19 Appendix A: Program of Weight calculation & K-Means Clustering Algorithm 5.1 Cluster Analysis of Mobile phones 5.1.1 Calculating weight of Mobile Phones public static double calculate_weight(int[] b){ double weight=0.0; float x,y; int N=0; Double[] distance = new Double[b.length]; for(int z=0;z //calculating mean for(int z=0;z //calculating variance for(int z=0;z 20 weight=deviation; return weight; } 5.1.2 K-Means Algorithm public static void K_Means(mobile[] a){ Scanner user_input = new Scanner(System.in); System.out.println("Enter Number of clusters you want to give:"); int k=user_input.nextInt(); int n = a.length; double[] m = new double[k]; //initializing initial centres of clusters for(int i=0;i mobile[][] m_set = new mobile[k][n]; double[][] fset = new double[k][n]; double group_variance=0.0; double total_variance=0.0; double variance_ratio=0.0; for (int j=0;j 21 for(int i=0;i // printing set in a file output_clusters try{ File file3 = new File("output_clusters.txt"); BufferedWriter output2 = new BufferedWriter(new FileWriter(file3.getAbsoluteFile())); for(int i=0;i try{ 22 File file4 = new File("output_name_clusters.txt"); BufferedWriter output3 = new BufferedWriter(new FileWriter(file4.getAbsoluteFile())); for(int i=0;i if((m_set[i][j].name).equals(last_name)){ mark =1; } } if(mark==1){ try{ File file5 = new File("set.txt"); BufferedWriter output4 = new BufferedWriter(new FileWriter(file5.getAbsoluteFile())); output4.write("\t the similar mobiles to last entry is:"); output4.newLine(); for(int z=1;z catch(IOException e){ System.out.println("There was a problem: "+ e); } 23 output3.write("\t } "); output3.newLine(); } output3.close(); }catch(IOException e){ System.out.println("There was a problem: "+ e); } } public static int difference(double a,double[] m){ int k=m.length; double[] diff = new double[k]; for(int i=0;i 24 //finding length of each set, for that assign each value of set as -1 initially public static int length(double[][] k,int a){ int count=0; for(int i=0;i 25 References 1. http://en.wikipedia.org/ 2. http://www.tutorialspoint.com/data_mining/ 3. http://www.gsmarena.com/ 4. http://saravananthirumuruganathan.wordpress.com/2010/01/27/k-means-clustering- algorithm/ 5. http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3 .%20K-Means%20Clustering%20Analysis.ipynb 6. Book on “Data Mining: Concepts and Techniques second edition” by Jiawei and Micheline Kamber 7. Book on Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein 26