International Journal of Pure and Applied Mathematics Volume 116 No. 12 2017, 57-65 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: 10.12732/ijpam.v116i12.7 Special Issue ijpam.eu

PSTREE BASED ASSOCIATIVE CLASSIFICATION OF DATA STREAM MINING

S.P. Siddique Ibrahim1, M.Pavithra2 and M. Sivabalakrishnan3 1Research Scholar, Department of Computing Science and Engineering, VIT University, Chennai, India [email protected] 2PG Student Department of Computer Science and Engineering, Kumaraguru College of Technology, Coimbatore-641049 [email protected] 3Associate Professor, School of Computing Science and Engineering, VIT University, Chennai. [email protected]

Abstract

The data streams have a modern technique to examine the problems of continuous data. Mining with data streams is the process of extracting knowledge structures from continuous, rapid data records [1]. An important goal in data stream mining is mainly used to generate a compact representation of data. The main aim of the proposed work is used to build efficient classifiers and improve the performance by aligning the datasets with a stream. This method m is also useful in reducing the and space needed for further decision making process. In this paper, a new scheme called Prefix Stream (PST) for associative classification has been proposed that helps in the compact structure of data streams. The Pstree has been developed based on a single scan. Pstree discovers the exact set of frequent itemsets from a single Scan.

Keywords: Data Streams, Data Stream Mining, Associative Classification. 1 Introduction

57 International Journal of Pure and Applied Mathematics Special Issue

Data mining is the process to explore the large set of data and extracting hidden patterns from different data types in order to previously unknown patterns. The discovery process in an automatic or semi-automatic [1]. For decision making data mining is also called as KDD(Knowledge Discovery) process, KDD involves the main steps are the data selection, data preprocessing, transformation, data mining, and evaluation. Data mining tasks including classification, clustering, association rule discovery, pattern recognition, regression, etc. [2]

There is two of learning model available in data mining one as supervised and other as unsupervised. In supervised learning, it contains the class label. For example, if the patient suffering from one or two symptoms, to identify whether that the patient suffering from that disease or not. On the other hand, training data set which has no class attribute is called as an unsupervised learning.

2 Proposed Work

A. Data Stream Data Stream Mining is the process of separating knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using reserved computing and storage capabilities. The main aim of the data stream mining is used to predict the correct class value of the data. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. Data streams have following characteristics. B. Windowing Model One of the important issues in the stream data mining is to find out a model which will suit the derivation process of the frequent item set from the streaming in data. A transaction data stream is a sequence of incoming transactions and an selection of the stream is called a window. A window, W, can be either time-based

58 International Journal of Pure and Applied Mathematics Special Issue

or count-based, and it also either a landmark window or a sliding window. If Window W is time-based and it consists of a sequence of fixed-length time units, where an irregular number of transactions may appear within each time unit. If W is count- based and it is composed of a sequence of batches, where each batch consists of an equal number of transactions. According to the data stream processing model the windows can be divided into three categories:  Landmark-window based mining [5]  Damped-window based mining[6]  Sliding-window based mining [7] as shown in Fig.1. For the window based approach two naive methods will be used: 1) whenever the new transaction enter into the window or the old transaction leaves the window frequent item set would be regenerated.

2) store the frequent and non-frequent item set in the traditional data structure such as prefix tree and add its support count whenever the new transaction enter into the window or leaves the window.

Fig.1 Landmark Window and Sliding Window

Different data models are been proposed due to the nature of data streams [8], [9], [10]. This paper deals with the landmark window model. In this model, all data streams from the starting time to until the current time are considered for mining. As a stream arrives it is appended continuously as time grows.

59 International Journal of Pure and Applied Mathematics Special Issue

The second data model is based on sliding window. In this model only recent data streams which fall within a window are considered for mining. In our work used landmark window data model over data streams for prediction of heart disease. 3 STRUCTURE OF PSTREE

A Prefix-tree [12] is an ordered tree which represents the transactions of the streams in a highly compressed form. Each read transaction is inserted into the tree in a path. Since different transactions can form a different branches and it have several common items, hence the paths in the tree will be overlapped. The the paths overlap with one another, the more compact tree is achieved.The PSTree, which is based on prefix tree schema. It is a compact representation of data streams. window W slides the tree is updated. Each time the window W contains equal number of transactions. Window slides batch by- batch.

A. Structure of the Prefix tree The PSTree is built based on nodes. First node of the tree is called as the root node which is introduce as ―null‖. Each subsequent node is called as ordinary node which represents the itemset and total number of passes (i.e., support) for that itemset in the path from the current window. The leaf nodes of the tree which contains the support, class label and batch counter. two types of nodes are maintained in the tree. B. Phases in construction of the tree The tree will be constructed based on two phases: Insertion and Restructuring. Insertion phase catch the stream contents into tree based on arranged order I-list. Restructuring phase restructures the tree in descending order from I-list. Restructuring is done after inserting a batch of action using Insertion phase. Two phases which are dynamically executed one after the other. C. Construction of tree

60 International Journal of Pure and Applied Mathematics Special Issue

From the table batch size of window is selected based on this first two dataset will be selected. first the tree will be empty and the support count will be calculated is shown in I list. after this insertion phase and restructuring phase will be done. Support is used as a measure of significance of the rule and confidence is used as measure of strength of rule. Steps involved in Pstree for prediction are STEP 1: Construction of PSTree using streams flowing through landmark window. STEP 2: Restructuring PSTree for compact representation. STEP 3: Extracting frequent item sets from the tree whenever there is a need for mining. This extraction consists of all items which satisfy the user defined threshold called minimum support. Representation of these frequent item sets is [List of frequent item sets] support STEP 4: Generate rules from frequent item sets which satisfy minimum confidence. Representation of rules is [List of frequent item sets] [Class label] confidence STEP 5: Prune the rules and build the classifier. STEP 6: Use the classifier and predict the test data. STEP 7: Find the accuracy of prediction. Table 1.1 Isort List For Insertion Phase

Isort ISort Reconstruction 3:1 3:1 4:1 4:1 6:1 6:1 2:1 2:1 5:1 5:1 7:1 7:1

61 International Journal of Pure and Applied Mathematics Special Issue

Table 1.2 Isort List For Restructuring Phase Table III Sample Dataset

A1 A2 A3 Class Label 3 4 6 1 2 5 7 2 3 4 6 1 2 5 7 2

Fig 2. Insertion phase

Fig 3. Restructuring phase

The tree is refreshed all the time with the exact information about frequent itemsets along with rules is provided for the current window. In cases where a rule item is associated with multiple classes, only the class with the largest support

62 International Journal of Pure and Applied Mathematics Special Issue

count is considered. Restructuring phase of the tree can be done using either branch sorting or path adjusting method proposed in [10]. 4 Experomental Results And Performance Study

To evaluate the accuracy, efficiency, and scalability of Pstree, we have performed an extensive performance study. In this section, we report our experimental results on comparing Pstree against popular classification methods: CBA [13]. It shows that Pstree outperforms both CBA[13] and Pstree in terms of average accuracy, efficiency, and scalability.

Accuracy Graph

100 99 98 97 96 95 94 Accuracy 93 92 91 diabet heart cancer irish es CBA 93.75 95.23 96.9 98.6 Pstree 96.6 98.3 97 98.8 Dataset

5 CONCLUSIONS

PSTree which is proposed using the concept of prefix tree and was restructured to handle the stream data. The constructed tree is a compact tree which reduces the memory consumption. It also helps in finding the exact set of recent frequent itemsets and predicts the class label for the requested tuple and it also reduces the rule generation and to improve the performance.

63 International Journal of Pure and Applied Mathematics Special Issue

REFERENCES

[1] Suzan Wedyan ―review and comparison of associative classification data mining approaches‖, International Journal of Computer, Control, Quantum and Information Engineering Vol:8, No:1, 2014.

[2] Fayyad, U., and Irani, K. (1993) Multi—interval discretization of continues-valued attributes for classification learning. Proceedings of IJCAI, pp. 1022-1027. 1993.

[3] K.Prasanna Lakshmi, Dr.C.R.K.Reddy, ―A Survey on Different Trends in Data Streams ― pp.451-455, In Proc of 2010 IEEE International Conference on Networking and Information Technology, (ICNIT’10), 2010. ISBN : 978-1-4244-7577-3.

[4] Manku, G.S., & Motwani, R. (2002). Approximate frequency counts over data streams. In Proceedings of the 28th international conference on very large data bases, (pp. 346–357).

[5] J. H. Chang and W. S. Lee. Finding Recent Frequent Itemsets Adaptively over Online Data Streams. InProc. of KDD, 2003.

[6] J. H. Chang and W. S. Lee. A Sliding Window method for Finding Recently Frequent Itemsets over Online Data Streams. In Journal of Information Science and Engineering, Vol. 20, No. 4, July, 2004.

[7] Manku, G.S., & Motwani, R. (2002). Approximate frequency counts over data streams. In Proceedings of the 8th international conference on very large data bases, (pp. 346–357).

[8] J. H. Chang and W. S. Lee. Finding Recent Frequent Itemsets Adaptively over Online Data Streams. In Proc. of KDD, 2003.

64 65 66