Clustering Data Streams

Data-Centric Systems and Applications Series editors M.J. Carey S. Ceri Editorial Board A. Ailamaki S. Babu P. Bernstein J.C. Freytag A. Halevy J. Han D. Kossmann I. Manolescu G. Weikum K.-Y. Whang J.X. Yu More information about this series at http://www.springer.com/series/5258 Minos Garofalakis r Johannes Gehrke r Rajeev Rastogi Editors Data Stream Management Processing High-Speed Data Streams Editors Minos Garofalakis Rajeev Rastogi School of Electrical and Amazon India Computer Engineering Bangalore, India Technical University of Crete Chania, Greece Johannes Gehrke Microsoft Corporation Redmond, WA, USA ISSN 2197-9723 ISSN 2197-974X (electronic) Data-Centric Systems and Applications ISBN 978-3-540-28607-3 ISBN 978-3-540-28608-0 (eBook) DOI 10.1007/978-3-540-28608-0 Library of Congress Control Number: 2016946344 Springer Heidelberg New York Dordrecht London © Springer-Verlag Berlin Heidelberg 2016 The fourth chapter in part 4 is published with kind permission of © 2004 Association for Computing Machinery, Inc.. All rights reserved. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Contents Data Stream Management: A Brave New World ............. 1 Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Part I Foundations and Basic Stream Synopses Data-Stream Sampling: Basic Techniques and Results .......... 13 Peter J. Haas Quantiles and Equi-depth Histograms over Streams ........... 45 Michael B. Greenwald and Sanjeev Khanna Join Sizes, Frequency Moments, and Applications ............ 87 Graham Cormode and Minos Garofalakis Top-k Frequent Item Maintenance over Streams ............. 103 Moses Charikar Distinct-Values Estimation over Data Streams .............. 121 Phillip B. Gibbons The Sliding-Window Computation Model and Results .......... 149 Mayur Datar and Rajeev Motwani Part II Mining Data Streams Clustering Data Streams .......................... 169 Sudipto Guha and Nina Mishra Mining Decision Trees from Streams .................... 189 Geoff Hulten and Pedro Domingos Frequent Itemset Mining over Data Streams ............... 209 Gurmeet Singh Manku v vi Contents Temporal Dynamics of On-Line Information Streams .......... 221 Jon Kleinberg Part III Advanced Topics Sketch-Based Multi-Query Processing over Data Streams ........ 241 Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Approximate Histogram and Wavelet Summaries of Streaming Data .. 263 S. Muthukrishnan and Martin Strauss Stable Distributions in Streaming Computations ............. 283 Graham Cormode and Piotr Indyk Tracking Queries over Distributed Streams ................ 301 Minos Garofalakis Part IV System Architectures and Languages STREAM: The Stanford Data Stream Management System ....... 317 Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Mayur Datar, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom The Aurora and Borealis Stream Processing Engines ........... 337 Ugur˘ Çetintemel, Daniel Abadi, Yanif Ahmad, Hari Balakrishnan, Magdalena Balazinska, Mitch Cherniack, Jeong-Hyon Hwang, Samuel Madden, Anurag Maskey, Alexander Rasin, Esther Ryvkina, Mike Stonebraker, Nesime Tatbul, Ying Xing, and Stan Zdonik Extending Relational Query Languages for Data Streams ........ 361 N. Laptev, B. Mozafari, H. Mousavi, H. Thakkar, H. Wang, K. Zeng, and Carlo Zaniolo Hancock: A Language for Analyzing Transactional Data Streams ... 387 Corinna Cortes, Kathleen Fisher, Daryl Pregibon, Anne Rogers, and Frederick Smith Sensor Network Integration with Streaming Database Systems ..... 409 Daniel Abadi, Samuel Madden, and Wolfgang Lindner Part V Applications Stream Processing Techniques for Network Management ........ 431 Charles D. Cranor, Theodore Johnson, and Oliver Spatscheck High-Performance XML Message Brokering ............... 451 Yanlei Diao and Michael J. Franklin Fast Methods for Statistical Arbitrage ................... 473 Eleftherios Soulas and Dennis Shasha Contents vii Adaptive, Automatic Stream Mining .................... 499 Spiros Papadimitriou, Anthony Brockwell, and Christos Faloutsos Conclusions and Looking Forward ..................... 529 Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Data Stream Management: A Brave New World Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi 1 Introduction Traditional data-management systems software is built on the concept of persistent data sets that are stored reliably in stable storage and queried/updated several times throughout their lifetime. For several emerging application domains, how- ever, data arrives and needs to be processed on a continuous (24 × 7) basis, without the benefit of several passes over a static, persistent data image. Such continuous data streams arise naturally, for example, in the network installations of large Tele- com and Internet service providers where detailed usage information (Call-Detail- Records (CDRs), SNMP/RMON packet-flow data, etc.) from different parts of the underlying network needs to be continuously collected and analyzed for interesting trends. Other applications that generate rapid, continuous and large volumes of stream data include transactions in retail chains, ATM and credit card operations in banks, financial tickers, Web server log records, etc. In most such applications, the data stream is actually accumulated and archived in a database-management system of a (perhaps, off-site) data warehouse, often making access to the archived data prohibitively expensive. Further, the ability to make decisions and infer interesting M. Garofalakis (B) School of Electrical and Computer Engineering, Technical University of Crete, University Campus—Kounoupidiana, Chania 73100, Greece e-mail: [email protected] J. Gehrke Microsoft Corporation, One Microsoft Way, Redmond, WA 98052-6399, USA e-mail: [email protected] R. Rastogi Amazon India, Brigade Gateway, Malleshwaram (W), Bangalore 560055, India e-mail: [email protected] © Springer-Verlag Berlin Heidelberg 2016 1 M. Garofalakis et al. (eds.), Data Stream Management, Data-Centric Systems and Applications, DOI 10.1007/978-3-540-28608-0_1 2 M. Garofalakis et al. Fig. 1 ISP network monitoring data streams patterns on-line (i.e., as the data stream arrives) is crucial for several mission-critical tasks that can have significant dollar value for a large corporation (e.g., telecom fraud detection). As a result, recent years have witnessed an increasing interest in designing data-processing algorithms that work over continuous data streams, i.e., algorithms that provide results to user queries while looking at the relevant data items only once and in a fixed order (determined by the stream-arrival pattern). Example 1 (Application: ISP Network Monitoring) To effectively manage the operation of their IP-network services, large Internet Service Providers (ISPs), like AT&T and Sprint, continuously monitor the operation of their networking infrastructure at dedicated Network Operations Centers (NOCs). This is truly a large-scale monitoring task that relies on continuously collecting streams of usage information from hundreds of routers, thousands of links and interfaces, and blisteringly-fast sets of events at different layers of the network infrastructure (ranging from fiber- cable utilizations to packet forwarding at routers, to VPNs and higher-level trans- port constructs). These data streams can be generated through a variety of network- monitoring tools (e.g., Cisco’s NetFlow [10] or AT&T’s GigaScope probe [5]for monitoring IP-packet flows), For instance, Fig. 1 depicts an example ISP monitoring setup, with an NOC tracking NetFlow measurement streams from four edge routers in the network R1–R4. The figure also depicts a small fragment of the streaming data tables retrieved from routers R1 and R2 containing simple summary information for IP sessions. In real life, such streams are truly massive, comprising hundreds of attributes and billions of records—for instance, AT&T collects over one terabyte of NetFlow measurement data from its production network each day! Typically, this measurement data is periodically shipped off to a backend data warehouse for off-line analysis (e.g., at the end of the day). Unfortunately, such off-line analyses are painfully inadequate when it comes to critical network- management tasks, where reaction in (near) real-time is absolutely essential. Such tasks include, for instance, detecting malicious/fraudulent

Clustering Data Streams

FOCS 2005 Program SUNDAY October 23, 2005

Hierarchical Clustering with Global Objectives: Approximation Algorithms and Hardness Results

Constraint Clustering and Parity Games

Implementation of Locality Sensitive Hashing Techniques

Model Checking Large Design Spaces: Theory, Tools, and Experiments

Detecting Near-Duplicates for Web Crawling

Time-Release Cryptography from Minimal Circuit Assumptions

Moses Charikar Stanford University

Approximation Algorithms for Grammar-Based

Rajeev Motwani

Practical Hash Functions for Similarity Estimation and Dimensionality Reduction

STOC ’98 to Qualify for the Early Registration Fee, Your Registra- Tion Application Must Be Postmarked by Thursday, April 30, 1998