Database and Knowledge-Base Systems: Data Mining Martin Ester

Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown •and potentially useful. Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application. CMPT 843, SFU, Martin Ester, 1-06 2 Introduction Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based inferences • focus on numeric data Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases CMPT 843, SFU, Martin Ester, 1-06 3 Introduction KDD Process [Han & Kamber 2000] Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Databases Data Integration KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Focussing Pre- Trans- Data Evaluation processing formation Mining Database Pattern Knowledge CMPT 843, SFU, Martin Ester, 1-06 4 Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks b • • a b b b • • • • • a a • • • a b b • • a a • clustering classification • • • • • • • A and B C • • • • • association rules generalisation other tasks: regression, outlier detection . CMPT 843, SFU, Martin Ester, 1-06 5 Trends in KDD Research KDD 2000 Conference • New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process CMPT 843, SFU, Martin Ester, 1-06 6 Trends in KDD Research KDD 2002 Conference • Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications CMPT 843, SFU, Martin Ester, 1-06 7 Trends in KDD Research KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . .) CMPT 843, SFU, Martin Ester, 1-06 8 Trends in KDD Research KDD 2005 Conference • Clustering •Privacy • Mining Spatio-Temporal Data • Mining Data Streams •SVMs • Text and Web Mining • Mining (Social) Networks • Graph Mining (best paper on graphs over time) CMPT 843, SFU, Martin Ester, 1-06 9 Trends in KDD Research Increasing Importance • Mining data streams • Clustering high-dimensional data • Mining spatio-temporal data • Privacy-preserving data mining • Network analysis • Graph mining • Multi-relational data mining CMPT 843, SFU, Martin Ester, 1-06 10 Overview of this Course Prerequisites Basics in database systems and statistics Introductory graduate data mining course Objectives • Introduction into some hot topics of data mining research • Introduction into some ongoing research projects of our DDM Lab • General research methodology • Presentation skills start thesis work after this class! CMPT 843, SFU, Martin Ester, 1-06 11 Overview of this Course Topics • Clustering high-dimensional data • Mining data streams • Spatio-temporal data mining • Multi-relational data mining • Graph mining CMPT 843, SFU, Martin Ester, 1-06 12 Overview of this Course Format • Tutorial surveys • Research paper presentations (and discussions) • Small research projects Grading • Paper presentation • Project presentation • Project report originality, technical quality, presentation quality CMPT 843, SFU, Martin Ester, 1-06 13 Clustering High-Dimensional Data Applications Biological Data • Micro-Array Data: rows = genes, columns = conditions / experiments, value measures the expression level of gene under given condition • Often: thousands of columns • Co-regulated genes: similar expression levels in a subset of all conditions Text / Web Data • Text / web document: attributes = term frequencies • Typically, >> 1000 relevant terms • Document clusters: document sets that share some important terms CMPT 843, SFU, Martin Ester, 1-06 14 Clustering High-Dimensional Data Curse of Dimensionality • The more dimensions, the larger the (average) pairwise distances • Clusters only in lower-dimensional subspaces clusters only in 1-dimensional subspace „salary“ CMPT 843, SFU, Martin Ester, 1-06 15 Clustering High-Dimensional Data Approaches • In approach1, cluster: dense connected region in data space • Find interesting subspaces, then clusters within these subspaces density threshold hard to determine (should be different) clusters highly overlapping • In approach 2, start with full-dimensional clustering and iteratively refine the clusters and relevant cluster dimensions result ill-defined number of clusters / cluster dimensions hard to determine CMPT 843, SFU, Martin Ester, 1-06 16 Mining Data Streams Applications • Telecommunications o Telecommunications providers collect call records (from, to, when, how long, . .) o Want to use the data not only for billing, but also for analysis (monitor trends in usage, customer segmentation, campaign design, . .) • Sensor networks o Network of distributed sensors measuring several parameters such as precipitation, temperature, amount of traffic, blood pressure, . o Data need to be monitored and analyzed on-line (immediate response) CMPT 843, SFU, Martin Ester, 1-06 17 Mining Data Streams Challenges • Characteristics of data streams o Massive volumes of data o Records arrive at a rapid rate • Requirements o Main memory to small to store all records o Each record is examined at most once o Real time response, i.e. very efficient processing CMPT 843, SFU, Martin Ester, 1-06 18 Mining Data Streams Approach Main Memory Synopsis Data Stream 1 Stream (Approximate) . Processing Answer Engine Data Stream m • Summarize using samples, histograms or novel methods such as CF-trees How to maximize the approximation accuracy? How to exploit the temporal dimension (aging of data)? CMPT 843, SFU, Martin Ester, 1-06 19 Spatio-Temporal Data Mining Applications • Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location) • Health care data analysis Analysis of the spread of diseases Interventions by Public Health Authorities Data referencing the earth surface (spatial) and the time (temporal) CMPT 843, SFU, Martin Ester, 1-06 20 Spatio-Temporal Data Mining Challenges • Independence assumption no longer valid Attribute values of neighboring objects are typically correlated • Operations on spatial data are very expensive Spatial objects are complex (lines, polygons, 3D surfaces, . .) which makes the corresponding operations very expensive • Temporal dimension Blows up the pattern search space What patterns do we really want to find in spatio-temporal DB? CMPT 843, SFU, Martin Ester, 1-06 21 Spatio-Temporal Data Mining Approaches • Consider spatial auto-correlation Find only patterns that deviate from what is expected according to spatial auto-correlation • Efficient support by the DBMS Indexes, basic operations, . • Models for spatio-temporal data mining Definition of new pattern types such as spatio-temporal trends CMPT 843, SFU, Martin Ester, 1-06 22 Multi-Relational Data Mining Applications • Mining biological data o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways, . o Want to learn, e.g., about the process of gene regulation • Text mining o Using information extraction methods, entities (companies, persons, genes, . .) and their relationships (directs, married, regulates, . .) can be extracted from a text document o Can be used as input for true text mining: finding knowledge rather than documents CMPT 843, SFU, Martin Ester, 1-06 23 Multi-Relational Data Mining Limitations of Existing Methods • Emerging applications are inherently multi-relational o Input: multiple tables (entity sets) and their relationships o Record characteristics: own attributes, related records from other tables and the attributes of these related records • Existing data mining methods are single-relational o Input: a single table (relation), Output: refers to attributes of a single table o Data representation as a universal relation (single table) is possible, but may loose a lot of information propositional logic CMPT 843, SFU, Martin Ester, 1-06 24 Multi-Relational Data Mining Approaches • Inductive Logic Programming o Logic program: facts (records) and deduction rules (background knowledge) o Task: find (first order) logic rules with some target predicate in the conclusion o Restrict search space by user-specified (syntactic) constraints huge search space syntactic constraints are hard to define only for classification tasks CMPT 843, SFU, Martin Ester, 1-06 25 Multi-Relational Data Mining Approaches • First-order versions of standard data mining algorithms o Multi-relational decision trees o Multi-relational association rules What rule format / semantics (in particular, aggregation operations)? • Multi-relational distances

Load more