Temporal Data Mining
Total Page:16
File Type:pdf, Size:1020Kb
Temporal Data Mining © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 1 2/4/10 9:46:30 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A. AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand- books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues. PUBLISHED TITLES UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications Sugato Basu, Ian Davidson, and Kiri L. Wagsta KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn MULTIMEDIA DATA MINING: A Systematic Introduction to Concepts and Theory Zhongfei Zhang and Ruofei Zhang NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, Second Edition Harvey J. Miller and Jiawei Han TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis TEMPORAL DATA MINING Theophano Mitsa © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 2 2/4/10 9:46:30 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Temporal Data Mining Theophano Mitsa © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 3 2/4/10 9:46:31 AM MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft- ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software. Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number: 978-1-4200-8976-9 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Mitsa, Theophano. Temporal data mining / Theophano Mitsa. p. cm. -- (Chapman & Hall/CRC data mining and knowledge discovery series) Includes bibliographical references and index. ISBN 978-1-4200-8976-9 (hardcover : alk. paper) 1. Data mining. 2. Temporal databases. I. Title. II. Series. QA76.9.D343M593 2010 005.75’3--dc22 2009048856 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 4 2/4/10 9:46:31 AM To my parents, who taught me to spend every moment wisely, and to the Eternal One, who taught me that every moment is infinitely important. © 2010 by Taylor and Francis Group, LLC C9765_C000e.indd 5 2/2/10 5:30:39 PM Table of Contents Preface, xix CHAPTER 1 ▪ Temporal Databases and Mediators 1 1.1 TIME IN DATABASES 1 1.1.1 Database Concepts 2 1.1.2 Temporal Databases 3 1.1.3 Time Representation in SQL 4 1.1.4 Time in Data Warehouses 5 1.1.5 Temporal Constraints and Temporal Relations 5 1.1.6 Requirements for a Temporal Knowledge- Based Management System 6 1.1.7 Using XML for Temporal Data 7 1.1.8 Temporal Entity Relationship Models 8 1.2 DATABASE MEDIATORS 9 1.2.1 Temporal Relation Discovery 10 1.2.2 Semantic Queries on Temporal Data 12 1.3 AddITIONAL BIBLIOGRAPHY 15 1.3.1 Additional Bibliography on Temporal Primitives 15 1.3.2 Additional Bibliography on Temporal Constraints and Logic 15 vii © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 7 2/4/10 9:50:33 AM viii ◾ Table of Contents 1.3.3 Additional Bibliography on Temporal Languages and Frameworks 16 REFERENCES 17 CHAPTER 2 ▪ Temporal Data Similarity Computation, Representation, and Summarization 21 2.1 TEMPORAL DATA TYPES AND PREPROCESSING 22 2.1.1 Temporal Data Types 22 2.1.2 Temporal Data Preprocessing 22 2.1.2.1 Data Cleaning 22 2.1.2.2 Data Normalization 25 2.2 TIME SERIES SIMILARITY MEASURES 26 2.2.1 Distance-Based Similarity 27 2.2.1.1 Euclidean Distance 27 2.2.1.2 Absolute Difference 28 2.2.1.3 Maximum Distance Metric 28 2.2.2 Dynamic Time Warping 28 2.2.3 The Longest Common Subsequence 31 2.2.4 Other Time Series Similarity Metrics 31 2.3 TIME SERIES REPRESENTATION 33 2.3.1 Nonadaptive Representation Methods 33 2.3.1.1 Discrete Fourier Transform 34 2.3.1.2 Discrete Wavelet Transform 34 2.3.1.3 Piecewise Aggregate Composition 37 2.3.2 Data-Adaptive Representation Methods 38 2.3.2.1 Singular Value Decomposition of Time Sequences 38 2.3.2.2 Shape Definition Language and CAPSUL 39 2.3.2.3 Landmark-Based Representation 40 2.3.2.4 Symbolic Aggregate Approximation (SAX) and iSAX 42 2.3.2.5 Adaptive Piecewise Constant Approximation (APCA) 43 © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 8 2/4/10 9:50:34 AM Table of Contents ◾ ix 2.3.2.6 Piecewise Linear Representation (PLA) 43 2.3.3 Model-Based Representation Methods 44 2.3.3.1 Markov Models for Representation and Analysis of Time Series 44 2.3.4 Data Dictated Representation Methods 45 2.3.4.1 Clipping 45 2.3.5 Comparison of Representation Schemes and Distance Measures 45 2.3.6 Need for Time Series Data Mining Benchmarks 46 2.4 TIME SERIES SUMMARIZATION METHODS 46 2.4.1 Statistics-Based Summarization 47 2.4.1.1 Mean 47 2.4.1.2 Median 47 2.4.1.3 Mode 47 2.4.1.4 Variance 47 2.4.2 Fractal Dimension–Based Summarization 48 2.4.3 Run-Length–Based Signature 48 2.4.3.1 Short Run-Length Emphasis 49 2.4.3.2 Long Run-Length Emphasis 49 2.4.4 Histogram-Based Signature and Statistical Measures 50 2.4.5 Local Trend-Based Summarization 51 2.5 TEMPORAL EVENT REPRESENTATION 52 2.5.1 Event Representation Using Markov Models 52 2.5.2 A Formalism for Temporal Objects and Repetitions 53 2.6 SIMILARITY COMPUTATION OF SEMANTIC TEMPORAL OBJECTS 54 2.7 TEMPORAL KNOWLEDGE REPRESENTATION IN CASE-BASED REASONING SYSTEMS 55 2.8 AddITIONAL BIBLIOGRAPHY 56 2.8.1 Similarity Measures 56 2.8.2 Dimensionality Reduction 57 © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 9 2/4/10 9:50:34 AM x ◾ Table of Contents 2.8.3 Representation and Summarization Techniques 58 2.8.4 Similarity and Query of Data Streams 59 REFERENCES 59 CHAPTER 3 ▪ Temporal Data Classification and Clustering 67 3.1 CLASSIFICATION TECHNIQUES 68 3.1.1 Distance-Based Classifiers 68 3.1.1.1 K–Nearest Neighbors 69 3.1.1.2 Exemplar-Based Nearest Neighbor 72 3.1.2 Bayes Classifier 72 3.1.3 Decision Trees 78 3.1.4 Support Vector Machines in Classification 81 3.1.5 Neural Networks in Classification 82 3.1.6 Classification Issues 83 3.1.6.1 Classification Error Types 83 3.1.6.2 Classifier Success Measures 84 3.1.6.3 Generation of the Testing and Training Sets 85 3.1.6.4 Comparison of Classification Approaches 85 3.1.6.5 Feature Processing 85 3.1.6.6 Feature Selection 86 3.2 CLUSTERING 86 3.2.1