Interactive Event-Driven Knowledge Discovery from Data Streams

Total Page:16

File Type:pdf, Size:1020Kb

Interactive Event-Driven Knowledge Discovery from Data Streams UC Irvine UC Irvine Electronic Theses and Dissertations Title Interactive Event-driven Knowledge Discovery from Data Streams Permalink https://escholarship.org/uc/item/8bc5k0j3 Author Jalali, Laleh Publication Date 2016 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Interactive Event-driven Knowledge Discovery from Data Streams DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Laleh Jalali Dissertation Committee: Professor Ramesh Jain, Chair Professor Gopi Meenakshisundaram Professor Nalini Venkatasubramanian 2016 © 2016 Laleh Jalali TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF TABLES viii LIST OF ALGORITHMS ix ACKNOWLEDGMENTS x CURRICULUM VITAE xi ABSTRACT OF THE DISSERTATION xiv 1 Introduction 1 1.1 Data-driven vs. Hypothesis-driven . .3 1.2 Knowledge Discovery from Temporal Data . .5 1.3 Contributions . .7 1.4 Thesis Outline . .9 2 Understanding Knowledge Discovery Process 10 2.1 Knowledge Discovery Definitions . 11 2.2 New Paradigm: Actionable Knowledge Extraction . 13 2.3 Data Mining and Knowledge Discovery Software Tools . 15 2.4 Knowledge Discovery in Healthcare Applications . 20 2.5 Design Requirements for an Interactive Knowledge Discovery Framework . 23 3 Literature Review 26 3.1 Temporal Data Mining . 27 3.1.1 Definitions and Concepts . 27 3.1.2 Pattern Discovery . 29 3.1.3 Temporal Association Rules . 35 3.1.4 Time Series Data Mining . 36 3.1.5 Temporal Classification and Clustering . 37 3.2 Temporal Reasoning . 38 3.2.1 Interval-based Temporal Logic . 39 3.2.2 Event Calculus . 40 3.2.3 Situation Calculus . 42 ii 3.3 Qualitative Models and Qualitative Reasoning . 44 3.3.1 Qualitative Models . 45 3.3.2 Qualitative Simulation . 46 3.3.3 Qualitative Data Mining . 47 4 Data Model and Pattern Operators 53 4.1 Physical-World vs. Cyber World . 54 4.1.1 Events Perception in Human . 55 4.1.2 Events in Cyber World . 55 4.1.3 Bridge the Semantic Gap . 57 4.2 Time Model . 58 4.3 Event Model . 61 4.4 Hypothesis-Driven Pattern Operators . 64 4.4.1 Selection Operation ρ.P ........................... 66 4.4.2 Sequence Operation (ρ1; ρ2 .......................... 67 ( ) 4.4.3 Conditional Sequence Operation (ρ1 ;!∆t1 ρ2)............... 67 4.4.4 Concurrency Operation (ρ1) ρ2)....................... 68 4.4.5 Alternation (ρ1 ρ2)............................. 68 4.4.6 Time (!∆t ρ)..................................á 69 4.5 Data-driven Operators . .S . 69 ′ 4.5.1 Sequential Co-occurrence SEQ CO[∆t] ES; ES ............. 69 4.5.2 Concurrent Co-occurrence CON CO ES; ES′ .............. 70 ( ) 5 Overall Framework( ) 72 5.1 Interactive Knowledge Discovery and Data Mining Process . 72 5.2 Re-visiting Design Principles . 74 5.2.1 Human-centered Analysis . 74 5.2.2 Expressiveness of the Pattern Query Language . 75 5.2.3 Interactive Modeling Approach . 75 5.2.4 Extensibility . 76 5.2.5 Result Interpretation . 76 5.3 General System Architecture . 76 5.4 Pattern Formulation and Query Language . 79 5.4.1 Automata Model for Pattern Formulation . 81 5.5 Graphical User Interface . 83 6 Significant Pattern Extraction 87 6.1 Co-occurrence Patterns . 88 6.2 Processing Algorithms . 90 6.2.1 Sequential Pattern Mining . 91 6.2.2 Conditional Sequential Pattern Mining . 92 6.2.3 Concurrent Pattern Mining . 95 6.3 Visual Analytics Process . 95 6.4 Simulation Results . 98 iii 7 Objective Self 103 7.1 Introduction . 104 7.2 Toward Objective Self . 105 7.2.1 Anecdotal Self . 105 7.2.2 Diarizing Self . 106 7.2.3 Quantified Self . 107 7.2.4 Objective Self Has Arrived . 109 7.3 An Architecture for Objective Self . 111 7.4 Life Events . 112 7.4.1 Life Log . 114 7.4.2 Life Event Recognition . 115 7.4.3 Formal Concept Analysis . 119 7.5 Frequent Behavior Pattern Extraction . 123 7.5.1 Co-occurrence Behavior Patterns . 124 7.5.2 Processing Co-occurrence Patterns . 125 7.6 Evaluation . 127 7.6.1 Data Collection . 127 7.6.2 Sequential Co-occurrence: Commute Behavior and Activity Trends . 129 7.6.3 Concurrent Co-occurrence: Multitasking Behavior . 131 7.6.4 Patterns Across a Group of Users . 132 7.6.5 The Effect of Environmental Factors on Behavior . 133 8 Asthma Risk Management 136 8.1 Introduction . 136 8.2 Motivation . 138 8.3 Related Work in Asthma Risk Factor Prediction . 139 8.4 Approach . 140 8.5 Data Pre-processing . 141 8.5.1 Topic Modeling . 142 8.5.2 Environmental Event Stream Modeling . 142 8.6 Experiments . 144 8.6.1 Data-driven Risk Factor Recognition . 144 8.6.2 Interactive Asthma Risk Factor Assessment . 152 9 Conclusion and Future Work 154 Bibliography 157 iv LIST OF FIGURES Page 1.1 From data to abstractions with model building process. Models are used not only for prediction but for understanding and explaining. .2 1.2 The cycle of knowledge discovery. .6 2.1 The process of Knowledge Discovery in Databases. 13 2.2 Interactive Knowledge Discovery (IKDD) process. 15 3.1 Categorization of input data type and temporal data mining algorithms. 29 3.2 Taxonomy of temporal Data mining. 30 3.3 Qualitative reasoning in action . 46 3.4 A qualitative tree induced from a set of examples for the function z = x2 - y2. The rightmost leaf, applying when attributes x and y are positive, says that z is strictly increasing in its dependence on x and strictly decreasing in its dependence on y [128] . 49 3.5 The graphs present the data and the Q2Q-learned regression functions based on two different qualitative explanations of the data. Left, the case with a three leaf qualitative tree; right, the case with a single leaf qualitative tree saying y M − x [129]. 51 4.1 Interaction= between( ) physical world and cuber world. Sensors act as interface between these two world. Objects and events are recognized in cyber world and effective models are built by understanding relations between cyber events. 54 4.2 Sample event media JSON for jogging and meeting events. Although events have general temporal, spatial, informational, structural, and experiential facets, their informational properties varies between different events. 58 4.3 Allens interval relations between the intervals X and Y. 59 4.4 Eleven semi-interval relationships. Question marks (?) in the pictorial illus- tration stand for either the symbol denoting the event depicted in the same line (X or Y) or for a blank. The number of question marks reflects the number of qualitatively alternative implementations of the given relation [46]. 60 + − 4.5 (a) Example encoding of a sequence of events. E1 and E1 represent the start and end times of event E1, respectively. Relational operators are used to indicate the ordered relations between start/end times. (b) Example of encoding a multi-event stream from two sequence of events. 63 v 4.6 Sample event streams ES(1), ES(2), and ES(3) and their corresponding event types. Pattern 1 and 2 are conditional sequential patterns, each one with two occurrences. 66 5.1 Interactive Knowledge Discovery/Data Mining Process . 73 5.2 High level architecture of the framework. 78 5.3 Basic Building Blocks of FSA in a high-level pattern formulation and query language . 80 5.4 The automaton corresponding to pattern ρ1 with 3 event components. It demonstrates 3 ordinary states, 2 time states, and EVALUATE() and SET() functions associated with each state. ..
Recommended publications
  • Bottleneck Discovery and Overlay Management in Network Coded Peer-To-Peer Systems
    Bottleneck Discovery and Overlay Management in Network Coded Peer-to-Peer Systems ∗ Mahdi Jafarisiavoshani Christina Fragouli EPFL EPFL Switzerland Switzerland mahdi.jafari@epfl.ch christina.fragouli@epfl.ch Suhas Diggavi Christos Gkantsidis EPFL Microsoft Research Switzerland United Kingdom suhas.diggavi@epfl.ch [email protected] ABSTRACT 1. INTRODUCTION The performance of peer-to-peer (P2P) networks depends critically Peer-to-peer (P2P) networks have proved a very successful dis- on the good connectivity of the overlay topology. In this paper we tributed architecture for content distribution. The design philoso- study P2P networks for content distribution (such as Avalanche) phy of such systems is to delegate the distribution task to the par- that use randomized network coding techniques. The basic idea of ticipating nodes (peers) themselves, rather than concentrating it to such systems is that peers randomly combine and exchange linear a low number of servers with limited resources. Therefore, such a combinations of the source packets. A header appended to each P2P non-hierarchical approach is inherently scalable, since it ex- packet specifies the linear combination that the packet carries. In ploits the computing power and bandwidth of all the participants. this paper we show that the linear combinations a node receives Having addressed the problem of ensuring sufficient network re- from its neighbors reveal structural information about the network. sources, P2P systems still face the challenge of how to efficiently We propose algorithms to utilize this observation for topology man- utilize these resources while maintaining a decentralized operation. agement to avoid bottlenecks and clustering in network-coded P2P Central to this is the challenging management problem of connect- systems.
    [Show full text]
  • Data Mining – Intro
    Data warehouse& Data Mining UNIT-3 Syllabus • UNIT 3 • Classification: Introduction, decision tree, tree induction algorithm – split algorithm based on information theory, split algorithm based on Gini index; naïve Bayes method; estimating predictive accuracy of classification method; classification software, software for association rule mining; case study; KDD Insurance Risk Assessment What is Data Mining? • Data Mining is: (1) The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets (2) Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD (2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner Knowledge Discovery Examples of Large Datasets • Government: IRS, NGA, … • Large corporations • WALMART: 20M transactions per day • MOBIL: 100 TB geological databases • AT&T 300 M calls per day • Credit card companies • Scientific • NASA, EOS project: 50 GB per hour • Environmental datasets KDD The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: (1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation Data Mining Methods 1. Decision Tree Classifiers: Used for modeling, classification 2. Association Rules: Used to find associations between sets of attributes 3. Sequential patterns: Used to find temporal associations in time series 4. Hierarchical
    [Show full text]
  • Developer Condemns City's Attitude Aican Appeal on Hold Avalanche
    Developer condemns city's attitude TERRACE -- Although city under way this spring, law when owners of lots adja- ment." said. council says it favours develop- The problem, he explained, cent to the sewer line and road However, alderman Danny ment, it's not willing to put its was sanitary sewer lines within developed their properties. Sheridan maintained, that is not Pointing out the development money where its mouth is, says the sub-division had to be hook- Council, however, refused the case. could still proceed if Shapitka a local developer. ed up to an existing city lines. both requests. The issue, he said, was paid the road and sewer connec- And that, adds Stan The nearest was :at Mountain • "It just seems the city isn't whether the city had subsidized tion costs, Sheridan said other Shapitka, has prompted him to Vista Drive, approximately too interested in lending any developers in the past -- "I'm developers had done so in the drop plans for what would have 850ft. fr0m-the:sou[hwest cur: type of assistance whatsoever," pretty sure it hasn't" -- and past. That included the city been the city's largest residential her of the development proper- Shapitka said, adding council whether it was going to do so in itself when it had developed sub-division project in many ty. appeared to want the estimated this case. "Council didn't seem properties it owned on the Birch years. Shapitka said he asked the ci- $500,000 increased tax base the willing to do that." Ave. bench and deJong Cres- In August of last year ty to build that line and to pave sub-division :would bring but While conceding Shapitka cent.
    [Show full text]
  • Towards a Situation-Aware Architecture for the Wisdom Web of Things
    Chapter 4 Towards a Situation-Aware Architecture for the Wisdom Web of Things Akihiro Eguchi, Hung Nguyen, Craig Thompson, and Wesley Deneke Abstract Computers are getting smaller, cheaper, faster, with lower power require- ments, more memory capacity, better connectivity, and are increasingly distributed. Accordingly, smartphones became more of a commodity worldwide, and the use of smartphones as a platform for ubiquitous computing is promising. Nevertheless, we still lack much of the architecture and service infrastructure we will need to transition computers to become situation aware to a similar extent that humans are. Our Everything is Alive (EiA) project illustrates an integrated approach to fill in the void with a broad scope of works encompassing Ubiquitous Intelligence (RFID, spa- tial searchbot, etc.), Cyber-Individual (virtual world, 3D modeling, etc.), Brain In- formatics (psychological experiments, computational neuroscience, etc.), and Web Intelligence (ontology, workflow, etc.). In this paper, we describe the vision and architecture for a future where smart real-world objects dynamically discover and interact with other real or virtual objects, humans or virtual human. We also dis- cuss how the vision in EiA fits into a seamless data cycle like the one proposed in the Wisdom Web of Things (W2T), where data circulate through things, data, in- formation, knowledge, wisdom, services, and humans. Various open research issues related to internal computer representations needed to model real or virtual worlds are identified, and challenges of using those representations to generate visualiza- tions in a virtual world and of “parsing” the real world to recognize and record these data structures are also discussed.
    [Show full text]
  • A Privacy-Aware and Secure System for Human Memory Augmentation
    A Privacy-aware and Secure System for Human Memory Augmentation Doctoral Dissertation submitted to the Faculty of Informatics of the Università della Svizzera italiana in partial fulfillment of the requirements for the degree of Doctor of Philosophy presented by Agon Bexheti under the supervision of Prof. Marc Langheinrich September 2019 Dissertation Committee Prof. Antonio Carzaniga Università della Svizzera italiana, Switzerland Prof. Fernando Pedone Università della Svizzera italiana, Switzerland Prof. Cecilia Mascolo University of Cambridge, United Kingdom Prof. Claudio Bettini Università degli Studi di Milano, Italy Dissertation accepted on 06 September 2019 Research Advisor PhD Program Director Prof. Marc Langheinrich Prof. Walter Binder and Prof. Silvia Santini i I certify that except where due acknowledgement has been given, the work presented in this thesis is that of the author alone; the work has not been submit- ted previously, in whole or in part, to qualify for any other academic award; and the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program. Agon Bexheti Lugano, 06 September 2019 ii Abstract The ubiquity of digital sensors embedded in today’s mobile and wearable devices (e.g., smartphones, wearable cameras, wristbands) has made technology more intertwined with our life. Among many other things, this allows us to seamlessly log our daily experiences in increasing numbers and quality, a process known as “lifelogging”. This practice produces a great amount of pictures and videos that can potentially improve human memory. Consider how a single photograph can bring back distant childhood memories, or how a song can help us reminisce about our last vacation.
    [Show full text]
  • Survey of Verification and Validation Techniques for Small Satellite Software Development
    Survey of Verification and Validation Techniques for Small Satellite Software Development Stephen A. Jacklin NASA Ames Research Center Presented at the 2015 Space Tech Expo Conference May 19-21, Long Beach, CA Summary The purpose of this paper is to provide an overview of the current trends and practices in small-satellite software verification and validation. This document is not intended to promote a specific software assurance method. Rather, it seeks to present an unbiased survey of software assurance methods used to verify and validate small satellite software and to make mention of the benefits and value of each approach. These methods include simulation and testing, verification and validation with model-based design, formal methods, and fault-tolerant software design with run-time monitoring. Although the literature reveals that simulation and testing has by far the longest legacy, model-based design methods are proving to be useful for software verification and validation. Some work in formal methods, though not widely used for any satellites, may offer new ways to improve small satellite software verification and validation. These methods need to be further advanced to deal with the state explosion problem and to make them more usable by small-satellite software engineers to be regularly applied to software verification. Last, it is explained how run-time monitoring, combined with fault-tolerant software design methods, provides an important means to detect and correct software errors that escape the verification process or those errors that are produced after launch through the effects of ionizing radiation. Introduction While the space industry has developed very good methods for verifying and validating software for large communication satellites over the last 50 years, such methods are also very expensive and require large development budgets.
    [Show full text]
  • The Fourth Paradigm
    ABOUT THE FOURTH PARADIGM This book presents the first broad look at the rapidly emerging field of data- THE FOUR intensive science, with the goal of influencing the worldwide scientific and com- puting research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud- computing technologies. This collection of essays expands on the vision of pio- T neering computer scientist Jim Gray for a new, fourth paradigm of discovery based H PARADIGM on data-intensive science and offers insights into how it can be fully realized. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” —Bill GaTES “I often tell people working in eScience that they aren’t in this field because they are visionaries or super-intelligent—it’s because they care about science The and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better.” —RhyS FRANCIS, AUSTRALIAN eRESEARCH INFRASTRUCTURE COUNCIL F OURTH “One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive
    [Show full text]
  • The Methbot Operation
    The Methbot Operation December 20, 2016 1 The Methbot Operation White Ops has exposed the largest and most profitable ad fraud operation to strike digital advertising to date. THE METHBOT OPERATION 2 Russian cybercriminals are siphoning At this point the Methbot operation has millions of advertising dollars per day become so embedded in the layers of away from U.S. media companies and the the advertising ecosystem, the only way biggest U.S. brand name advertisers in to shut it down is to make the details the single most profitable bot operation public to help affected parties take discovered to date. Dubbed “Methbot” action. Therefore, White Ops is releasing because of references to “meth” in its results from our research with that code, this operation produces massive objective in mind. volumes of fraudulent video advertising impressions by commandeering critical parts of Internet infrastructure and Information available for targeting the premium video advertising download space. Using an army of automated web • IP addresses known to belong to browsers run from fraudulently acquired Methbot for advertisers and their IP addresses, the Methbot operation agencies and platforms to block. is “watching” as many as 300 million This is the fastest way to shut down the video ads per day on falsified websites operation’s ability to monetize. designed to look like premium publisher inventory. More than 6,000 premium • Falsified domain list and full URL domains were targeted and spoofed, list to show the magnitude of impact enabling the operation to attract millions this operation had on the publishing in real advertising dollars.
    [Show full text]
  • Frequent Item Set Mining Using INC MINE in Massive Online Analysis Frame Work
    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 45 ( 2015 ) 133 – 142 International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015) Frequent Item set Mining using INC_MINE in Massive Online Analysis Frame work Prof.Dr.P.K.Srimania, Mrs. Malini M. Patilb* aFormer Chairman and Director, R & D, Bangalore University, Karnataka, India bAssistant Professor , Dept of ISE , J.S.S. Academy of Technical Education, Bangalore-560060, Karnataka, India Research Scholar, Bharthiar University, Coimbatore, Tamilnadu Abstract Frequent Pattern Mining is one of the major data mining techniques, which is exhaustively studied in the past decade. The technological advancements have resulted in huge data generation, having increased rate of data distribution. The generated data is called as a 'data stream'. Data streams can be mined only by using sophisticated techniques. The paper aims at carrying out frequent pattern mining on data streams. Stream mining has great challenges due to high memory usage and computational costs. Massive online analysis frame work is a software environment used to perform frequent pattern mining using INC_MINE algorithm. The algorithm uses the method of closed frequent mining. The data sets used in the analysis are Electricity data set and Airline data set. The authors also generated their own data set, OUR-GENERATOR for the purpose of analysis and the results are found interesting. In the experiments five samples of instance sizes (10000, 15000, 25000, 35000, 50000) are used with varying minimum support and window sizes for determining frequent closed itemsets and semi frequent closed itemsets respectively. The present work establishes that association rule mining could be performed even in the case of data stream mining by INC_MINE algorithm by generating closed frequent itemsets which is first of its kind in the literature.
    [Show full text]
  • Declarative Consciousness for Reconstruction
    Journal of Artificial General Intelligence 4(3) 89-129, 2014 Submitted 2013-7-31 DOI: 10.2478/jagi-2013-0007 Accepted 2014-1-6 Declarative Consciousness for Reconstruction Leslie G Seymour [email protected] PersInVitro, LLC San Jose, CA, USA Editors: Randal Koene, Diana Deca Abstract Existing information technology tools are harnessed and integrated to provide digital specification of human consciousness of individual persons. An incremental compilation technology is proposed as a transformation of LifeLog derived persona specifications into a Canonical representation of the neocortex architecture of the human brain. The primary purpose is to gain an understanding of the semantical allocation of the neocortex capacity. Novel neocortex content allocation simulators with browsers are proposed to experiment with various approaches of relieving the brain from overload conditions. An IT model of the neocortex is maintained, which is then updated each time new stimuli are received from the LifeLog data stream; new information is gained from brain signal measurements; and new functional dependencies are discovered between live persona consumed/produced signals. Keywords: consciousness, LifeLog, mind extension, brain preservation, brain machine interface 1. Introduction Within the last 10-15 years, research in two different disciplines paid special attention to the challenge of computationally formalizing and capturing the information content in the human brain. Computational neuroscience, launched the “whole brain emulation” (WBE) effort (Sandberg and Bostron, 2008; Koene 2013s and 2013b) while computer science launched a number of LifeLog R&D programs (Bell G., 2004) (IPTO, 2003). While the WBE approach focused on attempts to accurately model neurons and neural systems by exploring their internal neural structures, LifeLog collected and computationally interpreted digital logs of sensory input and response data to / from the human body.
    [Show full text]
  • The Failure of Mandated Disclosure
    BEN-SHAHAR_FINAL.DOC (DO NOT DELETE) 2/3/2011 11:53 AM ARTICLE THE FAILURE OF MANDATED DISCLOSURE † †† OMRI BEN-SHAHAR & CARL E. SCHNEIDER This Article explores the spectacular prevalence, and failure, of the single most common technique for protecting personal autonomy in modern society: mandated disclosure. The Article has four Parts: (1) a comprehensive sum- mary of the recurring use of mandated disclosures, in many forms and circums- tances, in the areas of consumer and borrower protection, patient informed con- sent, contract formation, and constitutional rights; (2) a survey of the empirical literature documenting the failure of the mandated disclosure regime in informing people and in improving their decisions; (3) an account of the multitude of reasons mandated disclosures fail, focusing on the political dy- namics underlying the enactments of these mandates, the incentives of disclosers to carry them out, and, most importantly, on the ability of disclosees to use them; and (4) an argument that mandated disclosure not only fails to achieve its stated goal but also leads to unintended consequences that often harm the very people it intends to serve. INTRODUCTION ......................................................................................649 A. The Argument ................................................................. 649 B. The Method ..................................................................... 651 C. The Style ......................................................................... 652 I. THE DISCLOSURE EMPIRE: THE PERVASIVENESS OF MANDATED DISCLOSURE ...................................................................................652 † Frank & Bernice J. Greenberg Professor of Law, University of Chicago. †† Chauncey Stillman Professor of Law & Professor of Internal Medicine, Universi- ty of Michigan. Helpful comments were provided by workshop participants at The University of Pennsylvania, Georgetown University, The University of Michigan, Tel- Aviv University, and the Federal Reserve Banks of Chicago and Cleveland.
    [Show full text]
  • Performance Analysis of Hoeffding Trees in Data Streams by Using Massive Online Analysis Framework
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ePrints@Bangalore University Int. J. Data Mining, Modelling and Management, Vol. 7, No. 4, 2015 293 Performance analysis of Hoeffding trees in data streams by using massive online analysis framework P.K. Srimani R & D Division, Bangalore University Jnana Bharathi, Mysore Road, Bangalore-560056, Karnataka, India Email: [email protected] Malini M. Patil* Department of Information Science and Engineering, J.S.S. Academy of Technical Education, Uttaralli-Kengeri Main Road, Mylasandra, Bangalore-560060, Karnataka, India Email: [email protected] *Corresponding author Abstract: Present work is mainly concerned with the understanding of the problem of classification from the data stream perspective on evolving streams using massive online analysis framework with regard to different Hoeffding trees. Advancement of the technology both in the area of hardware and software has led to the rapid storage of data in huge volumes. Such data is referred to as a data stream. Traditional data mining methods are not capable of handling data streams because of the ubiquitous nature of data streams. The challenging task is how to store, analyse and visualise such large volumes of data. Massive data mining is a solution for these challenges. In the present analysis five different Hoeffding trees are used on the available eight dataset generators of massive online analysis framework and the results predict that stagger generator happens to be the best performer for different classifiers. Keywords: data mining; data streams; static streams; evolving streams; Hoeffding trees; classification; supervised learning; massive online analysis; MOA; framework; massive data mining; MDM; dataset generators.
    [Show full text]