<<

Database and Knowledge-Base Systems: Data Mining

Martin Ester

Simon Fraser University School of Computing Science

Graduate Course Spring 2006

CMPT 843, SFU, Martin Ester, 1-06 1 Introduction

[Fayyad, Piatetsky-Shapiro & Smyth 96]

Knowledge discovery in (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown •and potentially useful.

Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application.

CMPT 843, SFU, Martin Ester, 1-06 2 Introduction

Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based • focus on numeric data

Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data

Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases

CMPT 843, SFU, Martin Ester, 1-06 3 Introduction

KDD Process [Han & Kamber 2000] Knowledge Pattern Evaluation

Data Mining Task-relevant Data

Data Warehouse Selection Data Cleaning Databases Data Integration

KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]

Focussing Pre- Trans- Data Evaluation processing formation Mining

Database Pattern Knowledge

CMPT 843, SFU, Martin Ester, 1-06 4 Data Mining

Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks b • • a b b b • • • • • a a • • • a b b • • a a • clustering classification

• • • • • • • A and B C • • • • • association rules generalisation

other tasks: regression, outlier detection . . .

CMPT 843, SFU, Martin Ester, 1-06 5 Trends in KDD Research

KDD 2000 Conference

• New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process

CMPT 843, SFU, Martin Ester, 1-06 6 Trends in KDD Research

KDD 2002 Conference

• Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications

CMPT 843, SFU, Martin Ester, 1-06 7 Trends in KDD Research

KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . . .)

CMPT 843, SFU, Martin Ester, 1-06 8 Trends in KDD Research

KDD 2005 Conference • Clustering •Privacy • Mining Spatio-Temporal Data • Mining Data Streams •SVMs • Text and Web Mining • Mining (Social) Networks • Graph Mining (best paper on graphs over time)

CMPT 843, SFU, Martin Ester, 1-06 9 Trends in KDD Research

Increasing Importance • Mining data streams • Clustering high-dimensional data • Mining spatio-temporal data • Privacy-preserving data mining • Network analysis • Graph mining • Multi-relational data mining

CMPT 843, SFU, Martin Ester, 1-06 10 Overview of this Course

Prerequisites Basics in database systems and statistics Introductory graduate data mining course Objectives • Introduction into some hot topics of data mining research • Introduction into some ongoing research projects of our DDM Lab • General research methodology • Presentation skills start thesis work after this class!

CMPT 843, SFU, Martin Ester, 1-06 11 Overview of this Course

Topics

• Clustering high-dimensional data • Mining data streams • Spatio-temporal data mining • Multi-relational data mining • Graph mining

CMPT 843, SFU, Martin Ester, 1-06 12 Overview of this Course

Format • Tutorial surveys • Research paper presentations (and discussions) • Small research projects Grading • Paper presentation • Project presentation • Project report originality, technical quality, presentation quality

CMPT 843, SFU, Martin Ester, 1-06 13 Clustering High-Dimensional Data

Applications

Biological Data • Micro-Array Data: rows = genes, columns = conditions / experiments, value measures the expression level of gene under given condition • Often: thousands of columns • Co-regulated genes: similar expression levels in a subset of all conditions

Text / Web Data • Text / web document: attributes = term frequencies • Typically, >> 1000 relevant terms • Document clusters: document sets that share some important terms

CMPT 843, SFU, Martin Ester, 1-06 14 Clustering High-Dimensional Data

Curse of Dimensionality • The more dimensions, the larger the (average) pairwise distances • Clusters only in lower-dimensional subspaces

clusters only in 1-dimensional subspace „salary“

CMPT 843, SFU, Martin Ester, 1-06 15 Clustering High-Dimensional Data

Approaches • In approach1, cluster: dense connected region in data space • Find interesting subspaces, then clusters within these subspaces density threshold hard to determine (should be different) clusters highly overlapping • In approach 2, start with full-dimensional clustering and iteratively refine the clusters and relevant cluster dimensions  result ill-defined  number of clusters / cluster dimensions hard to determine

CMPT 843, SFU, Martin Ester, 1-06 16 Mining Data Streams

Applications

• Telecommunications o Telecommunications providers collect call records (from, to, when, how long, . . .) o Want to use the data not only for billing, but also for analysis (monitor trends in usage, customer segmentation, campaign design, . . .) • Sensor networks o Network of distributed sensors measuring several parameters such as precipitation, temperature, amount of traffic, blood pressure, . . . o Data need to be monitored and analyzed on-line (immediate response)

CMPT 843, SFU, Martin Ester, 1-06 17 Mining Data Streams

Challenges

• Characteristics of data streams o Massive volumes of data o Records arrive at a rapid rate • Requirements o Main memory to small to store all records o Each record is examined at most once o Real time response, i.e. very efficient processing

CMPT 843, SFU, Martin Ester, 1-06 18 Mining Data Streams

Approach

Main Memory Synopsis

Data Stream 1 Stream (Approximate) . . . Processing Answer Engine Data Stream m

• Summarize using samples, histograms or novel methods such as CF-trees How to maximize the approximation accuracy? How to exploit the temporal dimension (aging of data)?

CMPT 843, SFU, Martin Ester, 1-06 19 Spatio-Temporal Data Mining

Applications

• Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location) • Health care data analysis

Analysis of the spread of diseases Interventions by Public Health Authorities

Data referencing the earth surface (spatial) and the time (temporal)

CMPT 843, SFU, Martin Ester, 1-06 20 Spatio-Temporal Data Mining

Challenges

• Independence assumption no longer valid Attribute values of neighboring objects are typically correlated • Operations on spatial data are very expensive

Spatial objects are complex (lines, polygons, 3D surfaces, . . .)

which makes the corresponding operations very expensive

• Temporal dimension

Blows up the pattern search space

What patterns do we really want to find in spatio-temporal DB?

CMPT 843, SFU, Martin Ester, 1-06 21 Spatio-Temporal Data Mining

Approaches

• Consider spatial auto-correlation Find only patterns that deviate from what is expected according to spatial auto-correlation • Efficient support by the DBMS

Indexes, basic operations, . . .

• Models for spatio-temporal data mining

Definition of new pattern types such as spatio-temporal trends

CMPT 843, SFU, Martin Ester, 1-06 22 Multi-Relational Data Mining Applications • Mining biological data

o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways, . . . o Want to learn, e.g., about the process of gene regulation •

o Using information extraction methods, entities (companies, persons, genes, . . .) and their relationships (directs, married, regulates, . . .) can be extracted from a text document

o Can be used as input for true text mining: finding knowledge rather than documents

CMPT 843, SFU, Martin Ester, 1-06 23 Multi-Relational Data Mining Limitations of Existing Methods

• Emerging applications are inherently multi-relational o Input: multiple tables (entity sets) and their relationships o Record characteristics: own attributes, related records from other tables and the attributes of these related records • Existing data mining methods are single-relational o Input: a single table (relation), Output: refers to attributes of a single table o Data representation as a universal relation (single table) is possible, but may loose a lot of information propositional

CMPT 843, SFU, Martin Ester, 1-06 24 Multi-Relational Data Mining Approaches

• Inductive o Logic program: facts (records) and deduction rules (background knowledge) o Task: find (first order) logic rules with some target predicate in the conclusion o Restrict search space by user-specified (syntactic) constraints

huge search space

syntactic constraints are hard to define

only for classification tasks

CMPT 843, SFU, Martin Ester, 1-06 25 Multi-Relational Data Mining Approaches

• First-order versions of standard data mining algorithms o Multi-relational decision trees o Multi-relational association rules

What rule format / semantics (in particular, aggregation operations)? • Multi-relational distances o Family of distance functions with different depths, taking into account attributes of related records up to the given depth o Standard methods can be applied, e.g. k-means or k-NN classification (global) distance function looses a lot of information

CMPT 843, SFU, Martin Ester, 1-06 26 Graph Mining Applications

• Analysis of the o What are the most important web pages? o How will the internet / web look like next year? • Social network analysis o What customers should be targeted to maximize the profit of a marketing campaign? o Whom to immunize in order to stop spread of some virus? o Find abnormal subgraphs (e.g., criminal rings).

CMPT 843, SFU, Martin Ester, 1-06 27 Graph Mining Challenges

• Definition of new types of patterns o Certain subgraphs . . . o Which ones are interesting in a given application? • Complexity o Many graph algorithms are NP-complete. o Real graphs tend to be extremely large. Need efficient algorithms • Dynamics o Many networks evolve rapidly. CMPT 843, SFU, Martin Ester, 1-06 28 References

Text Books

• Han J., Kamber M., „Data Mining: Concepts and Techniques“, Morgan Kaufmann Publishers, 2000.

• Hand D., Mannila H., Smyth P. „Principles of Data Mining“, MIT Press, 2001.

• Mitchell T. M., „Machine Learning“, McGraw-Hill, 1997.

CMPT 843, SFU, Martin Ester, 1-06 29