Database and Knowledge-Base Systems: Data Mining
Martin Ester
Simon Fraser University School of Computing Science
Graduate Course Spring 2006
CMPT 843, SFU, Martin Ester, 1-06 1 Introduction
[Fayyad, Piatetsky-Shapiro & Smyth 96]
Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown •and potentially useful.
Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application.
CMPT 843, SFU, Martin Ester, 1-06 2 Introduction
Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based inferences • focus on numeric data
Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data
Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases
CMPT 843, SFU, Martin Ester, 1-06 3 Introduction
KDD Process [Han & Kamber 2000] Knowledge Pattern Evaluation
Data Mining Task-relevant Data
Data Warehouse Selection Data Cleaning Databases Data Integration
KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]
Focussing Pre- Trans- Data Evaluation processing formation Mining
Database Pattern Knowledge
CMPT 843, SFU, Martin Ester, 1-06 4 Data Mining
Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks b • • a b b b • • • • • a a • • • a b b • • a a • clustering classification
• • • • • • • A and B C • • • • • association rules generalisation
other tasks: regression, outlier detection . . .
CMPT 843, SFU, Martin Ester, 1-06 5 Trends in KDD Research
KDD 2000 Conference
• New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process
CMPT 843, SFU, Martin Ester, 1-06 6 Trends in KDD Research
KDD 2002 Conference
• Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications
CMPT 843, SFU, Martin Ester, 1-06 7 Trends in KDD Research
KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . . .)
CMPT 843, SFU, Martin Ester, 1-06 8 Trends in KDD Research
KDD 2005 Conference • Clustering •Privacy • Mining Spatio-Temporal Data • Mining Data Streams •SVMs • Text and Web Mining • Mining (Social) Networks • Graph Mining (best paper on graphs over time)
CMPT 843, SFU, Martin Ester, 1-06 9 Trends in KDD Research
Increasing Importance • Mining data streams • Clustering high-dimensional data • Mining spatio-temporal data • Privacy-preserving data mining • Network analysis • Graph mining • Multi-relational data mining
CMPT 843, SFU, Martin Ester, 1-06 10 Overview of this Course
Prerequisites Basics in database systems and statistics Introductory graduate data mining course Objectives • Introduction into some hot topics of data mining research • Introduction into some ongoing research projects of our DDM Lab • General research methodology • Presentation skills start thesis work after this class!
CMPT 843, SFU, Martin Ester, 1-06 11 Overview of this Course
Topics
• Clustering high-dimensional data • Mining data streams • Spatio-temporal data mining • Multi-relational data mining • Graph mining
CMPT 843, SFU, Martin Ester, 1-06 12 Overview of this Course
Format • Tutorial surveys • Research paper presentations (and discussions) • Small research projects Grading • Paper presentation • Project presentation • Project report originality, technical quality, presentation quality
CMPT 843, SFU, Martin Ester, 1-06 13 Clustering High-Dimensional Data
Applications
Biological Data • Micro-Array Data: rows = genes, columns = conditions / experiments, value measures the expression level of gene under given condition • Often: thousands of columns • Co-regulated genes: similar expression levels in a subset of all conditions
Text / Web Data • Text / web document: attributes = term frequencies • Typically, >> 1000 relevant terms • Document clusters: document sets that share some important terms
CMPT 843, SFU, Martin Ester, 1-06 14 Clustering High-Dimensional Data
Curse of Dimensionality • The more dimensions, the larger the (average) pairwise distances • Clusters only in lower-dimensional subspaces
clusters only in 1-dimensional subspace „salary“
CMPT 843, SFU, Martin Ester, 1-06 15 Clustering High-Dimensional Data
Approaches • In approach1, cluster: dense connected region in data space • Find interesting subspaces, then clusters within these subspaces density threshold hard to determine (should be different) clusters highly overlapping • In approach 2, start with full-dimensional clustering and iteratively refine the clusters and relevant cluster dimensions result ill-defined number of clusters / cluster dimensions hard to determine
CMPT 843, SFU, Martin Ester, 1-06 16 Mining Data Streams
Applications
• Telecommunications o Telecommunications providers collect call records (from, to, when, how long, . . .) o Want to use the data not only for billing, but also for analysis (monitor trends in usage, customer segmentation, campaign design, . . .) • Sensor networks o Network of distributed sensors measuring several parameters such as precipitation, temperature, amount of traffic, blood pressure, . . . o Data need to be monitored and analyzed on-line (immediate response)
CMPT 843, SFU, Martin Ester, 1-06 17 Mining Data Streams
Challenges
• Characteristics of data streams o Massive volumes of data o Records arrive at a rapid rate • Requirements o Main memory to small to store all records o Each record is examined at most once o Real time response, i.e. very efficient processing
CMPT 843, SFU, Martin Ester, 1-06 18 Mining Data Streams
Approach
Main Memory Synopsis
Data Stream 1 Stream (Approximate) . . . Processing Answer Engine Data Stream m
• Summarize using samples, histograms or novel methods such as CF-trees How to maximize the approximation accuracy? How to exploit the temporal dimension (aging of data)?
CMPT 843, SFU, Martin Ester, 1-06 19 Spatio-Temporal Data Mining
Applications
• Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location) • Health care data analysis
Analysis of the spread of diseases Interventions by Public Health Authorities