Resource and Knowledge Discovery in Large Scale Dynamic Networks

The Pennsylvania State University The Graduate School Department of Computer Science and Engineering RESOURCE AND KNOWLEDGE DISCOVERY IN LARGE SCALE DYNAMIC NETWORKS A Thesis in Computer Science and Engineering by Mei Li c 2007 Mei Li ° Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2007 The thesis of Mei Li has been reviewed and approved* by the following: Wang-Chien Lee Associate Professor of Computer Science and Engineering Thesis Co-Adviser Chair of Committee Anand Sivasubramaniam Professor of Computer Science and Engineering Thesis Co-Adviser Thomas La Porta Professor of Computer Science and Engineering Chao-Hsien Chu Associate Professor of Information Science and Technology Peng Liu Associate Professor of Information Science and Technology Raj Acharya Professor of Computer Science and Engineering Department Head *Signatures are on file in the Graduate School iii Abstract A massive amount of information, including multimedia files, relational data, scientific data, system usage logs, etc., is being collected and stored in a large number of host nodes connected as large scale dynamic networks (LSDNs), such as peer-to-peer (P2P) systems and sensor networks. A wide spectrum of applications, e.g., resource locating, network attack detection, market analysis, and scientific exploration, relies on efficient discovery and retrieval of resources and knowledge from the vast amount of data distributed in the network systems. With the rapid growth in the volume of data and the scale of networks, simply transferring the data generated at different host nodes to a single site for storing and processing becomes impractical, incurring excessive communication overhead while raising privacy concerns. Thus, a major challenge faced by LSDNs is to design decentralized infrastructures and algorithms that enable efficient resource and knowledge discovery in large scale dynamic networks. In this dissertation, various resource and knowledge discovery tasks ranging from simple tasks such as query processing to complex tasks such as network attack detection are systematically investigated, with a synergy of research efforts spanning multiple dis- ciplines, including distributed computing, network and data management. Efficient and robust infrastructures and algorithms are proposed to support these tasks, with particu- lar attention paid to various system issues including load balancing, maintenance, adaptivity to dynamic changes, data distribution and users access pattern in the networks. The superiority of these proposed ideas is demonstrated through extensive experiments iv using both synthetic data and real data. This dissertation provides profound insights on exploiting the vast amount of data for different applications, e.g., system performance tuning, network attack detection, market analysis, opens the new research direction on distributed data mining, and provides a solid foundation for exploring various data management tasks in the networks systems. It is expected that this study will have a deep impact on the deployment of various applications that mandate efficient management and mining of the vast amount of data distributed in the network systems. v Table of Contents List of Tables ...................................... xiii List of Figures ..................................... xiv Acknowledgments ................................... xviii Chapter 1. Introduction ................................ 1 1.1 Peer-to-PeerSystems........................... 2 1.2 ResearchIssues .............................. 3 1.2.1 ResourceDiscovery........................ 4 1.2.2 KnowledgeDiscovery. 5 1.3 DesignChallenges ............................ 6 1.4 DissertationOverview .......................... 7 1.5 Contributions ............................... 9 1.6 Roadmap ................................. 11 Chapter 2. Related Works .............................. 12 2.1 Resource Discovery and Knowledge Discovery in Centralized Systems 12 2.1.1 Resource Discovery in Centralized Systems . 12 2.1.2 ClusteringinCentralizedSystems . 15 2.2 DistributedResourceDiscovery . 16 2.2.1 SearchinUnstructuredOverlays . 16 vi 2.2.2 TopologyOptimization. 18 2.2.3 DesignofStructuredOverlays. 18 2.3 DistributedKnowledgeDiscovery . 23 Chapter 3. Content Based Search .......................... 28 3.1 DesignofSemanticSmallWorld . 28 3.1.1 Introduction............................ 28 3.1.2 Background............................ 32 3.1.2.1 SmallWorldNetwork . 32 3.1.2.2 pSearch and Rolling Index. 33 3.1.3 SemanticSmallWorld . 34 3.1.3.1 Overview ........................ 35 3.1.3.2 Constructing a k-Dimensional Semantic Small World 36 3.1.4 DimensionReduction . 41 3.1.4.1 NamingEncoding . 43 3.1.4.2 NamingEmbedding . 44 3.1.5 Search............................... 47 3.1.5.1 Search Space Resolution . 48 3.1.5.2 QueryAlgorithm. 51 3.1.6 SimulationSetup ......................... 53 3.1.7 SimulationResults . 57 3.1.7.1 Scalability ....................... 58 3.1.7.2 PeerClusteringEffects . 60 vii 3.1.7.3 Adaptivity to Data Distribution and Query Locality 62 3.1.7.4 TolerancetoPeerFailures. 66 3.1.7.5 LoadBalancing. 67 3.1.7.6 ResultQuality ..................... 69 3.1.8 Summary ............................. 72 3.2 ProcessingComplexQueries. 72 3.2.1 Introduction............................ 73 3.2.2 ResearchFormulation . 76 3.2.3 RangeQueryProcessing . 77 3.2.3.1 Range Query Algorithm . 78 3.2.3.2 Multi-Destination Query Propagation . 79 3.2.4 KNNQueryProcessing . 84 3.2.4.1 Incremental KNN Query Algorithm . 85 3.2.4.2 KNN Search Space Refinement . 85 3.2.4.3 ApproximateKNNQuery. 90 3.2.5 SimulationSetup ......................... 90 3.2.6 SimulationResults . 91 3.2.6.1 RangeQuery ...................... 91 3.2.6.2 KNNQuery ...................... 96 3.2.7 Summary ............................. 104 Chapter 4. Managing Multi-Dimensional Data Objects .............. 106 4.1 Introduction................................ 107 viii 4.2 Preliminaries ............................... 110 4.2.1 Background............................ 110 4.2.1.1 BalancedTreeIndex. 110 4.2.1.2 SkipGraph....................... 112 4.2.1.3 Wavelet......................... 113 4.2.2 SystemModel........................... 116 4.3 DistributedPeerTree(DPTree). 116 4.3.1 OverviewofDPTree . 117 4.3.2 Overlay Structure and Navigation Algorithm . 119 4.3.2.1 Tree-AwareOverlay . 119 4.3.2.2 Aggressive Navigation . 121 4.3.3 Wavelet-assisted Load balancing . 125 4.3.3.1 LoadMonitoring . 128 4.3.3.2 Load Adjusting . 131 4.3.4 MaintenanceinDPTree . 135 4.3.4.1 Peer Join/Leave/Failure . 136 4.3.4.2 Data Insertion/Deletion . 137 4.4 ApplicationofDPTree . 141 4.4.1 RangeQuery ........................... 142 4.4.2 KNearestNeighborQuery . 142 4.5 PerformanceEvaluation . 143 4.5.1 LoadBalancing.......................... 143 4.5.1.1 EffectofNetworkSize . 144 ix 4.5.1.2 Effect of Initial Load Distribution . 146 4.5.1.3 EffectofWaveletSize . 146 4.5.1.4 Enhancement on LR Mechanisms . 147 4.5.2 Routing .............................. 149 4.5.2.1 EffectofNetworkSize . 149 4.5.2.2 EffectofPeerDistribution . 151 4.5.3 QueryPerformance. 151 4.5.3.1 PointQuery ...................... 152 4.5.3.2 RangeQuery ...................... 154 4.5.3.3 KNNQuery ...................... 154 4.5.4 MaintenanceOverheads . 155 4.5.4.1 Overlay Maintenance Overheads . 155 4.5.4.2 Tree Maintenance Overheads . 157 4.6 Summary ................................. 157 Chapter 5. Identifying Frequent Items ........................ 159 5.1 Introduction................................ 159 5.2 In-NetworkFiltering ........................... 166 5.2.1 Aggregate Computation . 168 5.2.1.1 FormingHierarchy . 169 5.2.1.2 Computing Aggregates . 171 5.2.1.3 Updating Hierarchy . 172 5.2.2 CandidateFiltering. 172 x 5.2.2.1 ItemPartitioning . 173 5.2.2.2 Candidate Set Optimization . 173 5.2.3 CandidateVerification . 176 5.3 AnalysisofnetFilter ........................... 177 5.3.1 CostModelfornetFilter. 179 5.3.2 Cost Model for the Naive Approach . 180 5.3.3 Optimal Setting for the Size of Filters (g) ........... 180 5.3.4 Optimal Setting for the Number of Filters (f)......... 181 5.3.5 Setting netFilter Optimally In Practice . 183 5.4 PerformanceEvaluation . 186 5.4.1 EffectoftheFilterSizes . 187 5.4.2 EffectoftheNumberofFilters . 190 5.4.3 EffectofDataSkewness . 192 5.4.4 EffectofThreshold. 195 5.5 Summary ................................. 195 Chapter 6. Monitoring Changes on the Data Distribution ............. 197 6.1 Introduction................................ 197 6.1.1 ProblemFormulation. 199 6.1.2 Contributions........................... 202 6.2 Preliminaries ............................... 204 6.3 Wavenet.................................. 205 6.3.1 DataSummarization . 206 xi 6.3.1.1 The Weakness of Histogram . 206 6.3.1.2 Localwavelet . 209 6.3.2 DesignIssuesofWavenet . 212 6.3.3 Localwavelet Construction in a Sparsely Populated Data Do- main................................ 213 6.3.4 Adaptive Monitoring . 218 6.3.4.1 LocalFilterSetup . 219 6.3.4.2 FilterResolution . 223 6.4 PerformanceEvaluation . 225 6.4.1 ExperimentsSetup . 225 6.4.1.1 DataSets........................ 226 6.4.1.2 Setting of Localwavelet and Histogram . 226 6.4.2 PerformanceMetrics . 228 6.4.3 Results .............................. 229 6.4.3.1 SummaryErrors . 230 6.4.3.2 EffectoftheThreshold . 232 6.4.3.3 Effect of the Summary Size . 234 6.4.3.4 Effect of Adaptive Monitoring . 234 6.4.3.5 EffectofLocalwaveletRefinement . 237 6.5 Summary ................................. 240 Chapter 7. Distributed

Resource and Knowledge Discovery in Large Scale Dynamic Networks

Text Mining Course for KNIME Analytics Platform

A Survey of Clustering with Deep Learning: from the Perspective of Network Architecture

A Tree-Based Dictionary Learning Framework

Enhancement of DBSCAN Algorithm and Transparency Clustering Of

Comparison of Dimensionality Reduction Techniques on Audio Signals

DBSCAN++: Towards Fast and Scalable Density Clustering

Robust Hierarchical Clustering∗

Density-Based Clustering of Static and Dynamic Functional MRI Connectivity

Space-Time Hierarchical Clustering for Identifying Clusters in Spatiotemporal Point Data

Chapter G03 – Multivariate Methods

A Study of Hierarchical Clustering Algorithm

An Improvement for DBSCAN Algorithm for Best Results in Varied Densities