Database and Knowledge-Base Systems: Data Mining Martin Ester

Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown •and potentially useful. Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application. CMPT 843, SFU, Martin Ester, 1-06 2 Introduction Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based inferences • focus on numeric data Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases CMPT 843, SFU, Martin Ester, 1-06 3 Introduction KDD Process [Han & Kamber 2000] Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Databases Data Integration KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Focussing Pre- Trans- Data Evaluation processing formation Mining Database Pattern Knowledge CMPT 843, SFU, Martin Ester, 1-06 4 Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks b • • a b b b • • • • • a a • • • a b b • • a a • clustering classification • • • • • • • A and B C • • • • • association rules generalisation other tasks: regression, outlier detection . CMPT 843, SFU, Martin Ester, 1-06 5 Trends in KDD Research KDD 2000 Conference • New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process CMPT 843, SFU, Martin Ester, 1-06 6 Trends in KDD Research KDD 2002 Conference • Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications CMPT 843, SFU, Martin Ester, 1-06 7 Trends in KDD Research KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . .) CMPT 843, SFU, Martin Ester, 1-06 8 Trends in KDD Research KDD 2005 Conference • Clustering •Privacy • Mining Spatio-Temporal Data • Mining Data Streams •SVMs • Text and Web Mining • Mining (Social) Networks • Graph Mining (best paper on graphs over time) CMPT 843, SFU, Martin Ester, 1-06 9 Trends in KDD Research Increasing Importance • Mining data streams • Clustering high-dimensional data • Mining spatio-temporal data • Privacy-preserving data mining • Network analysis • Graph mining • Multi-relational data mining CMPT 843, SFU, Martin Ester, 1-06 10 Overview of this Course Prerequisites Basics in database systems and statistics Introductory graduate data mining course Objectives • Introduction into some hot topics of data mining research • Introduction into some ongoing research projects of our DDM Lab • General research methodology • Presentation skills start thesis work after this class! CMPT 843, SFU, Martin Ester, 1-06 11 Overview of this Course Topics • Clustering high-dimensional data • Mining data streams • Spatio-temporal data mining • Multi-relational data mining • Graph mining CMPT 843, SFU, Martin Ester, 1-06 12 Overview of this Course Format • Tutorial surveys • Research paper presentations (and discussions) • Small research projects Grading • Paper presentation • Project presentation • Project report originality, technical quality, presentation quality CMPT 843, SFU, Martin Ester, 1-06 13 Clustering High-Dimensional Data Applications Biological Data • Micro-Array Data: rows = genes, columns = conditions / experiments, value measures the expression level of gene under given condition • Often: thousands of columns • Co-regulated genes: similar expression levels in a subset of all conditions Text / Web Data • Text / web document: attributes = term frequencies • Typically, >> 1000 relevant terms • Document clusters: document sets that share some important terms CMPT 843, SFU, Martin Ester, 1-06 14 Clustering High-Dimensional Data Curse of Dimensionality • The more dimensions, the larger the (average) pairwise distances • Clusters only in lower-dimensional subspaces clusters only in 1-dimensional subspace „salary“ CMPT 843, SFU, Martin Ester, 1-06 15 Clustering High-Dimensional Data Approaches • In approach1, cluster: dense connected region in data space • Find interesting subspaces, then clusters within these subspaces density threshold hard to determine (should be different) clusters highly overlapping • In approach 2, start with full-dimensional clustering and iteratively refine the clusters and relevant cluster dimensions result ill-defined number of clusters / cluster dimensions hard to determine CMPT 843, SFU, Martin Ester, 1-06 16 Mining Data Streams Applications • Telecommunications o Telecommunications providers collect call records (from, to, when, how long, . .) o Want to use the data not only for billing, but also for analysis (monitor trends in usage, customer segmentation, campaign design, . .) • Sensor networks o Network of distributed sensors measuring several parameters such as precipitation, temperature, amount of traffic, blood pressure, . o Data need to be monitored and analyzed on-line (immediate response) CMPT 843, SFU, Martin Ester, 1-06 17 Mining Data Streams Challenges • Characteristics of data streams o Massive volumes of data o Records arrive at a rapid rate • Requirements o Main memory to small to store all records o Each record is examined at most once o Real time response, i.e. very efficient processing CMPT 843, SFU, Martin Ester, 1-06 18 Mining Data Streams Approach Main Memory Synopsis Data Stream 1 Stream (Approximate) . Processing Answer Engine Data Stream m • Summarize using samples, histograms or novel methods such as CF-trees How to maximize the approximation accuracy? How to exploit the temporal dimension (aging of data)? CMPT 843, SFU, Martin Ester, 1-06 19 Spatio-Temporal Data Mining Applications • Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location) • Health care data analysis Analysis of the spread of diseases Interventions by Public Health Authorities Data referencing the earth surface (spatial) and the time (temporal) CMPT 843, SFU, Martin Ester, 1-06 20 Spatio-Temporal Data Mining Challenges • Independence assumption no longer valid Attribute values of neighboring objects are typically correlated • Operations on spatial data are very expensive Spatial objects are complex (lines, polygons, 3D surfaces, . .) which makes the corresponding operations very expensive • Temporal dimension Blows up the pattern search space What patterns do we really want to find in spatio-temporal DB? CMPT 843, SFU, Martin Ester, 1-06 21 Spatio-Temporal Data Mining Approaches • Consider spatial auto-correlation Find only patterns that deviate from what is expected according to spatial auto-correlation • Efficient support by the DBMS Indexes, basic operations, . • Models for spatio-temporal data mining Definition of new pattern types such as spatio-temporal trends CMPT 843, SFU, Martin Ester, 1-06 22 Multi-Relational Data Mining Applications • Mining biological data o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways, . o Want to learn, e.g., about the process of gene regulation • Text mining o Using information extraction methods, entities (companies, persons, genes, . .) and their relationships (directs, married, regulates, . .) can be extracted from a text document o Can be used as input for true text mining: finding knowledge rather than documents CMPT 843, SFU, Martin Ester, 1-06 23 Multi-Relational Data Mining Limitations of Existing Methods • Emerging applications are inherently multi-relational o Input: multiple tables (entity sets) and their relationships o Record characteristics: own attributes, related records from other tables and the attributes of these related records • Existing data mining methods are single-relational o Input: a single table (relation), Output: refers to attributes of a single table o Data representation as a universal relation (single table) is possible, but may loose a lot of information propositional logic CMPT 843, SFU, Martin Ester, 1-06 24 Multi-Relational Data Mining Approaches • Inductive Logic Programming o Logic program: facts (records) and deduction rules (background knowledge) o Task: find (first order) logic rules with some target predicate in the conclusion o Restrict search space by user-specified (syntactic) constraints huge search space syntactic constraints are hard to define only for classification tasks CMPT 843, SFU, Martin Ester, 1-06 25 Multi-Relational Data Mining Approaches • First-order versions of standard data mining algorithms o Multi-relational decision trees o Multi-relational association rules What rule format / semantics (in particular, aggregation operations)? • Multi-relational distances

Database and Knowledge-Base Systems: Data Mining Martin Ester

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support