<<

International Journal of Mechanical Engineering and Technology (IJMET) Volume 9, Issue 1, January 2018, pp. 482–489, Article ID: IJMET_09_01_052 Available online at http://iaeme.com/Home/issue/IJMET?Volume=9&Issue=1 ISSN Print: 0976-6340 and ISSN Online: 0976-6359

© IAEME Publication Scopus Indexed

ANALYSIS AND PREDICTION OF SHOW POPULARITY RATING USING INCREMENTAL K-MEANS ALGORITHM

D. Anand Asst. Professor, KL University

A.V.Satyavani, B.Raveena and M.Poojitha KL University

ABSTRACT The Television Reality shows are increasing day-by-day in the present generations. There are many different ways to find the Television Rating Point (TRP). Firstly the raw data is taken based upon the People’s Meter and the no of views will be counted from that. Then we need to divide the whole data set into clusters based on different channels. Here the data set consists of channels. Select the particular channel and take the view count. Depending upon the number of views, rate the channel or show accordingly, like if the view count is more than 10,000, then allot 10 rating to that particular show. If any new data is present then add it in the middle of the proces, then the whole process starts again. With the help of proposed algorithm we can update, add new entries in the middle of the process also. Based on the number of views we will rate that the particular Television shows accordingly with the highest Rating. The TRP can be compared among different shows and be viewed in bar graphs, pie charts, histograms. We have K-Means and Incremental K-Means algorithms to compare the TRP. The comparison between the two algorithms is very clear on histograms. It is the easiest way of predicting TV show analysis. If the data is inaccurate it may result to fault values. Keywords: K-Means, Incremental K-Means, Clustering, Data Object Cite this Article: D. Anand, A.V.Satyavani, B.Raveena and M.Poojitha, Analysis and Prediction of Popularity Rating using Incremental K-Means Algorithm, International Journal of Mechanical Engineering and Technology 9(1), 2018. pp. 482–489. http://iaeme.com/Home/issue/IJMET?Volume=9&Issue=1

http://iaeme.com/Home/journal/IJMET 482 [email protected] D. Anand, A.V.Satyavani, B.Raveena and M.Poojitha

1. INTRODUCTION: Television has become a part in every one’s life. There are many shows telecasting in different channels. The number f viewers watching shows are increasing day-by-day. There are many ways to find out the view count. We can find out which show has the highest TRP and we considered it as the most viewed show. In Television Rating we consider the following like People’s Meter, Clustering, K-Means Clustering and Incremental K-Means Clustering algorithms. The People’s meter is a ‘box’ which is hooked up to each television and is accompanied by a remote control unit. It records the number of viewers and their details like Name, Age and Gender to identify which show they are viewing. Clustering means dividing the data into sub classes called Clusters. Based on the recently described cluster models If any clustering will be taken into consideration it has its own advantages and dis advantages. Based on the given data the algorithm has to be choose correctly. There are two algorithms like K-Means and Incremental K-Means algorithms. With the help of K-Means we can classify the data very effectively. The entire K-Means mainly depends upon the clusters selected. Single clusters are divided into sub clusters. Among them select any cluster and perform the clustering operations. The output mainly depends upon the K- value selected. Incremental K-Means Clustering is an extension to K-Means. The new data sets selected in the middle of the process are added into the existing database. The new data is always grouped with the previous data. This strategy optimizes the process and adapts to applications. Compared to K-Means, incremental K-Means is the most efficient one. A graph depicts that plotting number of shows on X-axis and the rating on Y-axis. Different channels have different rating analysis. The graph contains histograms and the channel with the highest popularity rating will be shown clearly in the graph.

2. LITERATURE SURVEY In Hartigan has given a paper research on determining the Threshold values for the new cluster based on the existing cluster. He proposed Incremental clustering which has attracted the attention of research community with Hartigan’s Leader clustering algorithm which uses to get the Threshold Values. By virtue of the distance the algorithm splits the data set into groups. An object is taken and makes it as a Leader and the remaining object in the group also lies in the same region at a distance some T. The data point which is first will be selected as a Leader Object. Similarly the remaining objects which are at different points are all made into one group. The data is to processed only once. The Leader algorithms handle static data bases only which has become a base for Incremental. Charikar studied a clustering algorithm to handle dynamic data bases also. The researches further proved the incremental algorithm and many models of it by providing an extension to the dynamic data bases. This paper analyses several natural greedy algorithms and proved that they perform rather poorly in the dynamic setting. BIRCH which was proposed by Zhang especially suitable for larger number of data items. For many clustering techniques balancing can be done with the help of this algorithm. Iterations can also be made normally by suing this technique. A data structure called Cluster feature tree used to split the data points and to increment them in sorted fashion. This algorithm reduces the memory usage. However it is the first clustering algorithm to handle clusters.

http://iaeme.com/Home/journal/IJMET 483 [email protected] Analysis and Prediction of Television Show Popularity Rating using Incremental K-Means Algorithm

The data mining approaches are suggested by Ester et al proposes Incremental DB Scan which is suitable for mining. The data bases have very frequent updates. Performing different operations on data bases like insertion, deletion the cluster need to be updated. However the insertions and deletions effects very small on the particular cluster. The algorithm checks whether which part of the space is affected by the new data. For the pair of objects this algorithm is very efficient one. Steinbach, Karypus and Kumar summarized a paper on hierarchical and Clustering algorithms mainly on K –Means algorithm. The main concept if this paper is to provide a detailed and comprehensive description of important clustering algorithms. There are many application areas like Computer Science, Machine learning. Based on the clusters and techniques many theories have been proposed in by them. The cluster sequential data and approaches are also discussed. Based on the general clustering techniques Jain and Dubes proposed many clustering tasks. The main goal of this paper is to find the very hard components of data present in clustering. The whole process is divided into two stages. In first the appropriate processing steps need to be selected like pre-processing, feature extraction and many more. With the help of these chosen steps measure the values. For analyzing this phase one should have a good knowledge on basics of data analysis and resulting domain. The second phase consists of exact patterns present in the required data sets. By using simple distance functions the approximate value can be calculated. Nagy proposed hierarchical clustering algorithms. The time and space complexity of these algorithms are very efficient when compared to typical k-means algorithms. With the help of these hybrid algorithms can be developed. Other advantages with these algorithms are simplicity and speed. The results may vary when we run on the other data sets and algorithms. Conclusions were made that variance results are not appropriate to the particular problem. Ball and Hall developed a new data called ISODATA. In this the K- value plays a major role. There are some thresholds present in the clusters the ISO data can merge the clusters. The splitting of the clusters is also possible in this area. The loop will be taken. Every time the iteration is completed the k- value got updated. New k-value is added all the time to calculate the number of clusters. By using this iteration, there is no guarantee for optimization. Further developments in this area will results in the optimization. Many researchers developed new operators to improve the efficiency and optimization purpose. The two modifications Proposed by Stephen J. Phillips are avoiding making unnecessary comparisons between data points and comparing with each other. The second modification is avoiding the algorithm sorting means. This modification helped the K-Means and other classification algorithms on any given wide range of data sets. Hirtoshi and Haruno proposed several techniques on training data. By applying distributional clustering as a main feature to obtain text classification. The difference between two features is same as features of distributions. The similar features will form into a cluster. They will involve in a same classification process. By using these distributions it reduces the number of accuracy.

3. K-MEANS ALGORITHM FOR CLUSTERING:

Clustering: Cluster means objects of similar group. Clustering is the process of collection of that different clusters. In other way it is the process of group of similar objects. In many fields Clustering is the main task for statistical analyzing the data in the data mining.

http://iaeme.com/Home/journal/IJMET 484 [email protected] D. Anand, A.V.Satyavani, B.Raveena and M.Poojitha

Figure 1.1 Clustering Model

K-Means Clustering: The main merits are its simplicity, memory efficiency and speed which allows it to run large data sets. To solve any clustering problem this learning technique will be help full. But there are some limitations by having the needs to be defined at certain count of steps in the process which are said to be unknown, very sensitive for outliers and for selection of initial seeds etc. For finding groups which is not have been explicitly labelled in data. In large scale and small scale applications we have different customers clusters are based on their needs and preference of channels. To make any type of grouping by using this algorithm. So this algorithm is said to be more adaptable for example behavioral segmentation and detecting anomalies. In addition to this if a data point is tracked then switches between groups over time can be used to detect meaning full changes in the data. To get the results in this algorithm looping is used. The algorithm runs between improvised centroid and computing data.

Choosing K –Value: The algorithm described above find’s the clusters and the dataset labels for a particular pre chosen K. To find the number of clusters in the data, the user needs to be run the K-means clustering algorithm for a range of k values and compare the results. The choice of k is often irregular which depends on the shape, scale and the distribution of data sets in clustering. One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. The random K points are to be chosen and assign to that particular objects. However it is important to keep in mind that K-means clustering may not perform well if it contain heavy overlapping data. A method called vector quantization often deals with data mining and techniques. The nearest neighbor classifier that obtains from K-means to classify new data from old data. It can be called as K nearest neighbor. It is the clustering technique used for clustering of data. It can be considered as another version of K-Means. It doesn’t use the mean and distance. Instead it is based upon voting of the nearest neighbors in the K-clusters.

4. COMPARING K-MEANS WITH DIFFERENT CLUSTERING TECHNIQUES:-

Hierarchical clustering: Dealing huge amounts of data in data mining is possible with K-means and also with the hierarchical clustering. But the time complexity is linear in K-means and rectangular in hierarchical clustering algorithm. Flexibility is more with K-means rather than hierarchical clustering. While running the K-means algorithm, it produce the results which might vary multiple times. But in hierarchical clustering the results are constant. Better results have been examined in K-means clustering when comparing with hierarchical clustering algorithm.

http://iaeme.com/Home/journal/IJMET 485 [email protected] Analysis and Prediction of Television Show Popularity Rating using Incremental K-Means Algorithm

Parallel clustering: The average value can be obtained by taking similarity among clusters. The process executed in K-means is Single threaded. However in parallel clustering the process is said to be executed in a parallel way which is a multi-threaded process. The intensive calculation for distances is preferred for an accurate result is given by k-means clustering rather than parallel clustering.

Gaussian mixture model: Determining the sub sets for the unknown parameters in the distribution mixture models are used. The data sets with random variables are chosen according to K-dimensional distributions. In K-means clustering the numbers of subsets are defined by the K value. Parameterized subsets are defined in K-means while undefined in the mixture model. For high dimensionality of clusters K-means clustering model is chosen instead of this mixture model.

Incremental K-Means Clustering: Any clustering algorithm will certainly have dis advantages. The dis advantages in K-means can be overcome with the help of incremental K-Means. In the existing database the data can be handled very effectively by using this clustering. General clustering can be performed on the K-means for static database then for the new data. Already K-Means clustering algorithm is present which is developed in java. The results that obtained from these are stored in any database. The new data is inserted into the databases in the middle of the process directly into the existing clusters. With the help of incremental clustering the data is directly inserted into the existing clusters. Finally the results of these two are compared an also evaluate the performance as well as its correct threshold value. So, we can observe that when K-means is used for incremental data then it can be re-run for the whole dataset always. But when we used incremental K-Means then it can’t be re-run for the whole database but only re-run for the outside points which are not put in the existing clusters. Therefore reducing computation time and give better accuracy proposed algorithm is used. Incremental clustering is a generalized approach to perform clustering on database initially, later on after adding the new data the process starts from that particular point.

Predicting Television Popularity Rating using Incremental K-Means:- To predict the Television show rating initially we consider raw data as an available channel from the database. By using Incremental K-Means algorithm we will select the required channel which generates clusters of different shows such as Reality show, Comedy show, and Sports shows .Now we select the required show and calculate the view count . Here the new data combines with the existed data from Fig 1.2. Using this Incremental K-Means algorithm predict the rating of the shows by view count within the updated database. Once the database is updated there are several possibilities available. The data can be directly added into the database even in the middle of the process also. If any new cluster is required then formation can be very easy with the help of incremental k-means. The updated cluster can also be merged with the existing cluster which cannot be possible in the other clustering techniques. By using these possibilities the new data can be modified accordingly.

http://iaeme.com/Home/journal/IJMET 486 [email protected] D. Anand, A.V.Satyavani, B.Raveena and M.Poojitha

Figure 1.2 Incremental K-Means

Comparison between Non-incremental and Incremental Clustering: The technology around us growing day-by-day. The databases have become very dynamic in nature. To improve the mechanism new data is added to the clusters. When new data is added in non-incremental results in the decrease in efficiency. This can be overcome by Incremental clustering which improves efficiency and helps in grouping the new data.

User Clustering and Data sets: The patterns of the users can be analyzed by presenting the identified clusters. This will helps in increasing the stability. In the particular data set each user can be assigned as a point and be evaluated. Then after applying incremental k-means algorithm the stability of that particular data be increased gradually. For the TRP predicting the clusters to be taken are different TV programs telecasted. Categories like comedy, drama, reality, educational were given a set of priority and are considered to be clusters. The view count can be calculated with the help of number of viewers watching that particular show accordingly.

5. EXPERIMENTAL RESULTS:- For the first classifier the output is totally based upon the K-value which is shown in Fig 1.3. The classifier generated a large and very maximum output. The second classifier is Incremental and the output obtained has maximum efficiency compared to the first classifier. For the histogram chart the number of shows telecasting, the chart clearly shows an increase in the given attributes. The number of shows with a rating less than 1000 is much smaller.

http://iaeme.com/Home/journal/IJMET 487 [email protected] Analysis and Prediction of Television Show Popularity Rating using Incremental K-Means Algorithm

Figure 1.3 Comparison of K-Means and Incremental For the analysis of Television Show Popularity Rating the most important factors are the number of shows and view count. After the Classification we can achieve highest accuracy of 97 % using Incremental K-means algorithm. With the help of these we can handle incremental data very efficiently. The histogram depicts that the maximum graph which is shown in the figure is K-Means. Our Classifier predict the rating of television shows which is more simplistic. We cannot include many factors because some of the attributes are not available for some of the other shows and inaccurate data. But the accuracy and efficiency of one algorithm is more than the other.

6. CONCLUSION: The proposed research aims to predict the TV show popularity rating The two algorithms used are K-Means and incremental K-Means for analysis of TV show popularity rating. After studying many algorithms the complexity and efficiency are very great in these algorithms. After performing classification and clustering , we have found that our best results are achieved through Incremental clustering algorithm at 97%. The attributes that are contributed to the most of information are number of shows and view count for each channel. The channel and the show that has got the highest TRP rating will be awarded as the most viewed show all over. More importantly our research shows further development in this area. We can easily calculate TRP rating with the help of these clustering techniques despite of the traditional methods and can gain more efficiency and time.

REFERENCES:

[1] G. Adomavicius and E. Tuzhilin. Towards the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 2005. [2] K. Ali and W. Van Stam. TiVo: making show recommendations using a distributed collaborative filtering architecture. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. [3] M. Anderson, M. Ball, H.Boley, S. Greene, N.Howse, D.Lemire, and S. McGrath. Racofi: A rule- applying collaborative filtering system. Proceedings of Collaboration Agents Conference (COLA), 2003.

http://iaeme.com/Home/journal/IJMET 488 [email protected] D. Anand, A.V.Satyavani, B.Raveena and M.Poojitha

[4] M. Balabanovic and Y.Shoham. Fab: content-based, collaborative recommendation. Communications of the ACM, (3), 1997. [5] J. Bar-Ilan, K. Keenoy, E. Yaari, and M.Levene. User rankings of search engine results. Journal of the American Society for Information Science and Technology, (9), 2007. [6] D. Billsus, M.J. Pazzani, and J. Chen.A learning agent for wireless news access. In Proceedings of the 5th international conference on intelligent user interfaces, 2000. [7] Han, J. Kamber, M,. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers :San Fransisco, 2001 [8] Neurosoft S.A, Neurosoft Envisioner, 1999. [9] Labovitz, M. L, What is data mining and What are its uses 2003 [10] R.C and jain, Dubes Clustering data algorithms 1988 [11] Karger, D.Pedersen, A Clustering –Based Approach to Large Document Collections (1992). [12] Rastogi R. and Shim, K. Clustering algorithms for categorical attributes (1999). [13] Larsen and Aone Fast and Effective Text mining Using Linear-time Document Clustering California (1999). [14] Clustering Analysis for Applications by Anderberg in 1973 by Newyork Academic press. [15] Cluster Analysis by Everitt, Second Edition [16] Pavel Berkhin Survey of Clustering Data Algorithms [17] Keogh, E, CHU, M.2001b: A new approach to indexing large databases. [18] Knowledge acquisition via incremental conceptual clustering by Fisher, Machine Learning. [19] Huang,Z. 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values. [20] Karypis, G. Han and Kumar, V.1999a Chameleon: A hierarchical clustering algorithms usinf dynamic modeling,

http://iaeme.com/Home/journal/IJMET 489 [email protected]