Summer Internship

Report On Design & Implementation of Classification & Clustering Algorithm for Mobile Phones

At

IDRBT, Hyderabad

6th May 2014 to 5th July 2014

Submitted By:

Ratnesh Chandak

B. Tech- CSE (2nd Year)

Roll Number: CS12B1030

Indian Institute of Technology, Hyderabad

Guided By:

Dr. V.N. Sastry

Designation: Professor

IDRBT, Hyderabad

Date of Submission: 4th July 2014

1

Abstract

This project is broadly divided in two parts, in the first part we would be finding intersection of common objects from different sets taking an example of finding common colored balls from different sets of containers.

In second part we would describe K-Means Clustering Algorithm to find similar numerical sets and we will experimentally find optimal value of “K” in clustering algorithm by using Elbow Method. Both the these are explained with an example of mobiles phones and done implementation in java.

2

Certificate

Certified that this is a bonafide record of summer internship project work entitled

Design & Implementation of Classification & Clustering Algorithm for Mobile Phones

Done By

Ratnesh Chandak

B. Tech - CSE (2nd Year)

Indian Institute of Technology, Hyderabad

at IDRBT, Hyderabad during 6th May to 5th July 2014

Prof. V N Sastry

(Project Guide)

IDRBT, Hyderabad

3

Acknowledgement

We have completed this project as summer internship in IDRBT (Institute for Development and Research in Banking Technology), Hyderabad under the guidance of Dr. V.N. Sastry. We would like to thank all our friends and our mentor for great support in completing this project.

Ratnesh Chandak Date:

(Project Trainee)

4

Contents

Chapter 1 Introduction

1.1 Project Objectives

1.2 Classification

1.3 Clustering

Chapter 2 Container-Ball Algorithm

2.1 Algorithm

2.2 Application

2.3 Remarks

Chapter 3 Cluster Analysis of Mobile Phones

3.1 Assigning weights to Mobile Phones

3.2 K-Means Clustering Algorithm

3.3 Finding Optimal K for Clustering Algorithm

3.4 Observations

Chapter 4 Conclusion and Future Work

Appendix A: Program of Weight calculation & K-Means Clustering Algorithm

References

5

Chapter 1: Introduction

1.1 Project Objectives

1. To find intersection of same colored balls of same sizes from a set of containers. 2. To do Cluster Analysis of Mobile phones available in market.

1.2 Classification

Classification refers to the task of predicting a class label for given unlabeled objects in the dataset.

Example:

1. A Bank loan officer needs analysis of past data to learn which loan applicants are safe and who are risky and accordingly the officer provides loan.

2. A marketing manager at a company needs to analyze to guess whether a customer with a given profile will buy a new computer.

How does Classification works?

Classification is a two step process:

1. Training Phase (Learning step): Using the past history (training data), a classification algorithm is build which defines classification rules for the Classifier. 2. Classification Phase (Labeling Step): The Classifier according to the classification rules gives label (class) to any new object.

1.2 Clustering

Clustering refers to grouping a set of objects in such a way that objects in the same cluster (group) are more similar (depending on their some or the other features) to each other than to those in other clusters.

Example:

1. Clustering helps in classifying documents on web for information discovery. 2. It can be used in Earthquake studies, city planning, market research, pattern recognition, data analysis, and image processing etc.

6

There are different types of clustering algorithm used to do cluster analysis, mainly:

1. Hierarchical Clustering 2. Centroid- Based Clustering 3. Distribution Based Clustering 4. Density-Based Clustering

7

Chapter 2: Container Ball Algorithm

2.1 Objective: Finding common number of same colored balls from different sets of containers. User Input:

1. Number of Containers (M) 2. Number of Colors (N). The input can be from console or from input file

Assumption: The number of containers should be greater than one.

Algorithm:

3. Sort the container in increasing order of the number of balls by Quicksort. 4. Check the number of container (M) A. If “M” is even, divide the M containers into group of two consecutive containers. Now find the intersection of common colored balls in the two containers in all such M/2 groups. And now delete the parent containers and form child containers whose composition is common colored balls from its parent containers. Now make M -> M/2. B. If “M” is odd, divide first M-1 containers into group of two consecutive containers. Now find the intersection of common colored balls in the two containers in all such (M-1)/2 groups. And now delete the parent containers and form child containers whose composition is common colored balls from its parent containers. Now make M -> (M-1/2) +1 5. Repeat steps (1) & (2) until we have only one container whose composition is common colored balls.

Pseudo Code:

A. Main (container M) 1. Container M = new Container[] 2. QUICKSORT(M) 3. while M.length ≠ 1 4. if M.length % 2 == 0 5. Container temp = new Container[] 6. int j=0, i=0 7. while i < M.length/2

8

8. temp[j] = combine (M[i], M[i+1]) 9. i = i+2 10. j = j+1 11. M = temp 12. else if M.length % 2 == 1 13. Container temp = new Container[] 14. int j=0, i=0 15. while i < (M.length-1)/2 16. temp[j] = combine (M[i], M[i+1]) 17. i = i+2 18. temp[j] = M[M.length] 19. M = temp

B. Container Class 1. String name 2. String[] colorarray= new String[] 3. int[] freqarray = new int[] 4. public Container(String name) 5. this.name=name; 6. public void defineColor(String name2,int b,int c) 7. Color2 temp2= new Color2(name2); 8. temp2.freq(b); 9. colorarray[c]=temp2.name; 10. freqarray[c]= temp2.frequency;

C. Color Class 1. String name 2. int frequency 3. public Color2(String name) 4. this.name=name 5. public void freq(int a) 6. frequency=a

D. combine(Container A, Container B) 1. Container2 commondata = new Container2("tempcontainer"); 2. n = M.length 3. for(int k=0;k

9

5. if(firstone.colorarray[k].compareTo(secondone.colorarray[l])==0){ 6. if(firstone.freqarray[k]<=secondone.freqarray[l]) 7. { commondata.defineColor(firstone.colorarray[k],firstone.freqarray[k) } 8. else if(firstone.freqarray[k]>secondone.freqarray[l]) 9. { commondata.defineColor(firstone.colorarray[k],secondone.freqarray[l]) } 10. return commondata

Container 1 Container 2 ….Container M

Size1 Size2 ..Size a Size 1 Size2 ..Sizeb Size 1 Size 2 ..Size m Color 1 Color 2 Color 3

..Color N Figure 2.1 M Containers with containing N colored balls of different sizes.

Example:

Input: Container 1 Container 2 Container 3

Size 1 Size 2 Size 3 Size 1 Size 2 Size 3 Size 1 Size 2 Size 3 Color 1 4 42 13 45 12 6 55 29 1 Color 2 8 12 7 35 17 7 23 5 5

Color 3 3 10 2 32 29 31 32 3 34 Color 4 12 11 23 11 27 71 21 0 2

Output table containing of common colored balls of similar sizes.

Size 1 Size 2 Size 3 Color 1 4 12 1 Color 2 8 5 5 Color 3 3 3 2 Color 4 11 0 2

2.2 Related Applications

1. True Caller Mobile Application. 2. In Facebook “People You May Know” Feature.

10

2.3 Remarks

1. This application can be used by Market manager of mobile application company, if he/she wants to estimate the number of mobiles phones which are compatible to their new application by finding intersection of features of their application from different sets of mobile. 2. This application can also be used as Idea of “Diet Check” per week, to know which all food components one is eating in each week by finding intersection of proteins, vitamins etc. from each day. This can be used even by Hospitals to regulate the amount of nutrients going to patients.

11

Chapter 3: Cluster Analysis of Mobile Phones In this project we would present about Centroid-Based clustering algorithm namely K-Means Clustering Algorithm. In this project we will take example of mobiles phones in market and we will group them in appropriate clusters and for any new launched mobile in market we will apply our method and we can know the appropriate clusters. But before moving to K-Means Clustering Algorithm, you have to know that there is a limitation to K-Means, it can be applied only to numerical databases, so to overcome that condition in case of mobile phones, we have given each mobile a weight.

5.1 Assigning weights to mobile phones To assign weight to mobile phones, follow below steps: 1. List all the features with which someone want to compare. 2. Assign priority order to each of the features 3. Give (X,Y) coordinate to each of the features as follows: a. Take some increasing function X= F(Z) and give the value of Z as 1,2,3,……n (“n” corresponds to number of features) b. Now take an increasing function for Y, Y= F(X). 4. For any given mobile we now have fixed number of features. Now calculate distances of all those features which are present in that from origin (0,0 ). 5. Calculate the standard deviation of distances of each mobile that will Corresponds to its weight.

12

In this project, we have considered the following features and the number before each feature corresponds to its priority order.

1. 2G 12. Document Viewer 23. Accelerometer 34. 2 GB RAM 2. Bluetooth 13. 3G 24. Screen Protection 35. 4G version 2.0 3. Radio 14. GPRS 25. Geo - Tagging 36. GPU (Graphic Card) 4. Mp3 Player 15. EDGE 26. Face Detection 37. Compass 5. Photo viewer/ 16. Wi-Fi 27. Dual core 38. Proximity sensor editor processor 6. Camera MP 17. GPS 28. 512 MB RAM 39. Gyroscope 0.3-2.0 7. MMS support 18. 256 MB RAM 29. Wi-Fi hotspot 40. NFC 8. Bluetooth 19. Camera MP 30. 1 GB RAM 41. HDMI port version 3.0 8.0-13.0 9. Touchscreen 20. HD Recording 31. Bluetooth version 42. Barometer 4.0 10. Camera MP 21. Java support 32. Camera MP 43. 3 GB RAM 2.0-7.9 above 8.0 11. USB 22. Mp4 Player 33. Quad core 44. Temperature processor sensor

Figure 3.1 Table representing priority order of features of mobiles.

Functions and formula used for calculating weight of Mobile

1. Here in this project we have used X=3*Z +1 & Y=X 2. Distance from origin = √

3. Mean = ∑

4. Variance = ∑

5. Standard Deviation = √

13

Figure 3.2 Price verses Weight analysis of 100 mobile dataset in our study representing sparse distributions of mobile phones. Later we would be forming clusters from these numerical values of mobiles.

60000

50000

40000

30000 Price

20000

10000

0 0 10 20 30 40 50 60

Weight For Java code Implementation of calculating weight of Mobile Phone check Appendix A 5.1.1

5.2 K-Means Clustering Algorithm Input: 1. Numerical Dataset of Mobile phones 2. Desired number of clusters (K) Idea: Each cluster has its own centroid, the different elements are grouped according to distances from different centroids. Algorithm: 1. Randomly choose k-weights and make them as initial centroids. 2. For each weight find the nearest centroid and assign that weight to the cluster associated with the nearest centroid. 3. Update the centroid of each cluster based on weights in that cluster. Typically the new centroid will be the mean of all weights in the cluster. 4. Repeat steps two and three till no point switches cluster.

14

Screenshots of implementation of K-Means Clustering Algorithm:

Figure 3.3 Java GUI for entering feature for each mobile phone by user.

Figure 3.4 Java GUI asking the cluster size from user.

15

Figure 3.5 Output of similar mobile phones which the user has entered.

For Java code Implementation of K-Means Clustering Algorithm check Appendix A 5.1.2

3.3 Finding Optimal K for clustering algorithm

5.2.1 Thumb rule: K= greatest integer of ( √ )

In this project, we made dataset of 100 Mobile phones, so optimal K by thumb rule = 7

5.2.2 Elbow Method:

1. Calculate the sum of square distance error (SSE) between each member of a cluster member and its cluster centroid. 2. Plot a graph between SSE and Number of Clusters. 3. If we increase value of “K”, the SSE will decrease as more clusters hence smaller distances to centroids. But each successive increase in K will not give same drop. At some point marginal loss will drop, giving the angle in the graph. 4. We call that value of K ,the elbow and that is the optimal value of K

16

10000

9000

8000

7000

6000

SSE 5000

4000

3000

2000

1000

0 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 Number of Clusters

Figure 3.6 Experimental graph of SSE verses Number of Clusters Optimal K from above graph = (5 + 11)/2 = 8

After finding the optimal K by Elbow Method, we found K=8, so putting K =8 in our cluster size, we obtain clusters as follows in our output file:

Set1= { Canvas Turbo, HTC Desire, Apple ipad 5, LG Optimus F7, , Tablet P, Thrive, Spice Mi-530, Lumia 610, ATIV , Continuum I400, , micromax a76, }

Set2= { HTC Flyer, LG P505, Motorola Fire, Nokia N8, Nokia X7-00, HP Veer, Ascend Y530, Acer Liquid Z5, , Alcatel OT-916, Xolo A600, Micromax A089, Micromax A121, Micromax A111, Nokia E7, Nokia T7, Samsung Epic, Geeksphone Peak, }

Set3= { HTC E8, Toshiba Excite, Nokia 808, Samsung P8510, , Alcatel Idol X+, }

17

Set4= { Asha 230, Apple iPhone, BlackBerry Curve, Samsung Gravity, U8300, Acer E210, Alcatel T10, Vodafone 845, Xolo X500, Micromax A47, Micromax Ninja A91, Micromax A52, Spice M-6900, Spice Mi-437, Spice Mi-506, Micromax P600, i8510 INNOV8, I8000 Omnia II, , ZTE open, }

Set5= { Nokia 225, HTC Viva, BlackBerry 7290, LG C360, LG CF360, Motorola A1010, Motorola C336, Nokia 3210, -02, Nokia E61i, Samsung D428, Samsung E740, Samsung S5630C, HP iPAQ 6515, Huawei T552, Huawei G6620, Huawei G6310, Acer M900, P526, Asus P750, Vodafone 553, Vodafone 1231, Toshiba TS32, Micromax Q1, Micromax Q80, Spice M-5390, Spice QT-58, Spice QT-50, Nokia N78, Nokia 7510, Nokia X3, Nokia E63, Nokia C2-03, Samsung G600, }

Set6= { Xperia Z, Lumia 930, BlackBerry Q5, LG G3, }

Set7= { GalaxyAce, BlackBerry Z3, Acer Zenfone, Alcatel Fire S, }

Set8= { Galaxy S5, Xperia Z2, }

3.4 Observations 1. Our method of clustering removes the limitation to K-Means Clustering Algorithm which was its limitation to numerical dataset, now we showed that assigning weight then applying the clustering algorithm solves our problem. 2. In our study we found K = 8 as optimal value of cluster size in our dataset of 100 mobile phones.

18

Chapter 4: Conclusion and Future Work

1. With the first application, one can find the intersection between different sets.

2. With second application, one can separate different types of mobiles from market under appropriate clusters and can use similar idea of assigning weight and clustering for other elements.

3. In future one can implement above problem with other clustering algorithms and compare each algorithm according to its accuracy, user-friendliness and time- complexity.

19

Appendix A: Program of Weight calculation & K-Means Clustering Algorithm

5.1 Cluster Analysis of Mobile phones 5.1.1 Calculating weight of Mobile Phones public static double calculate_weight(int[] b){ double weight=0.0; float x,y; int N=0; Double[] distance = new Double[b.length]; for(int z=0;z

//calculating mean for(int z=0;z

//calculating variance for(int z=0;z

20

weight=deviation; return weight; } 5.1.2 K-Means Algorithm public static void K_Means(mobile[] a){

Scanner user_input = new Scanner(System.in); System.out.println("Enter Number of clusters you want to give:"); int k=user_input.nextInt(); int n = a.length; double[] m = new double[k];

//initializing initial centres of clusters for(int i=0;i

mobile[][] m_set = new mobile[k][n];

double[][] fset = new double[k][n]; double group_variance=0.0; double total_variance=0.0; double variance_ratio=0.0; for (int j=0;j

21

for(int i=0;i

// printing set in a file output_clusters try{ File file3 = new File("output_clusters.txt");

BufferedWriter output2 = new BufferedWriter(new FileWriter(file3.getAbsoluteFile())); for(int i=0;i

try{

22

File file4 = new File("output_name_clusters.txt");

BufferedWriter output3 = new BufferedWriter(new FileWriter(file4.getAbsoluteFile())); for(int i=0;i

if((m_set[i][j].name).equals(last_name)){ mark =1; } } if(mark==1){

try{ File file5 = new File("set.txt"); BufferedWriter output4 = new BufferedWriter(new FileWriter(file5.getAbsoluteFile())); output4.write("\t the similar mobiles to last entry is:"); output4.newLine(); for(int z=1;z

catch(IOException e){ System.out.println("There was a problem: "+ e); }

23

output3.write("\t } "); output3.newLine(); } output3.close(); }catch(IOException e){ System.out.println("There was a problem: "+ e); } }

public static int difference(double a,double[] m){ int k=m.length; double[] diff = new double[k]; for(int i=0;i=m[i]) diff[i]=a-m[i]; else diff[i]=m[i]-a; } double temp=diff[0]; int marker=0; for(int i=0;idiff[i]){ temp=diff[i]; marker=i; } } return marker; } //calculating update centroid for each set public static double[] cal_mean(double[][] k){ double[] mean=new double[k.length]; for(int i=0;i

24

//finding length of each set, for that assign each value of set as -1 initially public static int length(double[][] k,int a){ int count=0; for(int i=0;i

25

References

1. http://en.wikipedia.org/ 2. http://www.tutorialspoint.com/data_mining/ 3. http://www.gsmarena.com/ 4. http://saravananthirumuruganathan.wordpress.com/2010/01/27/k-means-clustering- algorithm/ 5. http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3 .%20K-Means%20Clustering%20Analysis.ipynb 6. Book on “Data Mining: Concepts and Techniques second edition” by Jiawei and Micheline Kamber 7. Book on Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein

26