Resource and Knowledge Discovery in Large Scale Dynamic Networks

Total Page:16

File Type:pdf, Size:1020Kb

Resource and Knowledge Discovery in Large Scale Dynamic Networks The Pennsylvania State University The Graduate School Department of Computer Science and Engineering RESOURCE AND KNOWLEDGE DISCOVERY IN LARGE SCALE DYNAMIC NETWORKS A Thesis in Computer Science and Engineering by Mei Li c 2007 Mei Li ° Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2007 The thesis of Mei Li has been reviewed and approved* by the following: Wang-Chien Lee Associate Professor of Computer Science and Engineering Thesis Co-Adviser Chair of Committee Anand Sivasubramaniam Professor of Computer Science and Engineering Thesis Co-Adviser Thomas La Porta Professor of Computer Science and Engineering Chao-Hsien Chu Associate Professor of Information Science and Technology Peng Liu Associate Professor of Information Science and Technology Raj Acharya Professor of Computer Science and Engineering Department Head *Signatures are on file in the Graduate School iii Abstract A massive amount of information, including multimedia files, relational data, scientific data, system usage logs, etc., is being collected and stored in a large number of host nodes connected as large scale dynamic networks (LSDNs), such as peer-to-peer (P2P) systems and sensor networks. A wide spectrum of applications, e.g., resource locating, network attack detection, market analysis, and scientific exploration, relies on efficient discovery and retrieval of resources and knowledge from the vast amount of data distributed in the network systems. With the rapid growth in the volume of data and the scale of networks, simply transferring the data generated at different host nodes to a single site for storing and processing becomes impractical, incurring excessive communication overhead while raising privacy concerns. Thus, a major challenge faced by LSDNs is to design decentralized infrastructures and algorithms that enable efficient resource and knowledge discovery in large scale dynamic networks. In this dissertation, various resource and knowledge discovery tasks ranging from simple tasks such as query processing to complex tasks such as network attack detection are systematically investigated, with a synergy of research efforts spanning multiple dis- ciplines, including distributed computing, network and data management. Efficient and robust infrastructures and algorithms are proposed to support these tasks, with particu- lar attention paid to various system issues including load balancing, maintenance, adap- tivity to dynamic changes, data distribution and users access pattern in the networks. The superiority of these proposed ideas is demonstrated through extensive experiments iv using both synthetic data and real data. This dissertation provides profound insights on exploiting the vast amount of data for different applications, e.g., system performance tuning, network attack detection, market analysis, opens the new research direction on distributed data mining, and provides a solid foundation for exploring various data man- agement tasks in the networks systems. It is expected that this study will have a deep impact on the deployment of various applications that mandate efficient management and mining of the vast amount of data distributed in the network systems. v Table of Contents List of Tables ...................................... xiii List of Figures ..................................... xiv Acknowledgments ................................... xviii Chapter 1. Introduction ................................ 1 1.1 Peer-to-PeerSystems........................... 2 1.2 ResearchIssues .............................. 3 1.2.1 ResourceDiscovery........................ 4 1.2.2 KnowledgeDiscovery. 5 1.3 DesignChallenges ............................ 6 1.4 DissertationOverview .......................... 7 1.5 Contributions ............................... 9 1.6 Roadmap ................................. 11 Chapter 2. Related Works .............................. 12 2.1 Resource Discovery and Knowledge Discovery in Centralized Systems 12 2.1.1 Resource Discovery in Centralized Systems . 12 2.1.2 ClusteringinCentralizedSystems . 15 2.2 DistributedResourceDiscovery . 16 2.2.1 SearchinUnstructuredOverlays . 16 vi 2.2.2 TopologyOptimization. 18 2.2.3 DesignofStructuredOverlays. 18 2.3 DistributedKnowledgeDiscovery . 23 Chapter 3. Content Based Search .......................... 28 3.1 DesignofSemanticSmallWorld . 28 3.1.1 Introduction............................ 28 3.1.2 Background............................ 32 3.1.2.1 SmallWorldNetwork . 32 3.1.2.2 pSearch and Rolling Index. 33 3.1.3 SemanticSmallWorld . 34 3.1.3.1 Overview ........................ 35 3.1.3.2 Constructing a k-Dimensional Semantic Small World 36 3.1.4 DimensionReduction . 41 3.1.4.1 NamingEncoding . 43 3.1.4.2 NamingEmbedding . 44 3.1.5 Search............................... 47 3.1.5.1 Search Space Resolution . 48 3.1.5.2 QueryAlgorithm. 51 3.1.6 SimulationSetup ......................... 53 3.1.7 SimulationResults . 57 3.1.7.1 Scalability ....................... 58 3.1.7.2 PeerClusteringEffects . 60 vii 3.1.7.3 Adaptivity to Data Distribution and Query Locality 62 3.1.7.4 TolerancetoPeerFailures. 66 3.1.7.5 LoadBalancing. 67 3.1.7.6 ResultQuality ..................... 69 3.1.8 Summary ............................. 72 3.2 ProcessingComplexQueries. 72 3.2.1 Introduction............................ 73 3.2.2 ResearchFormulation . 76 3.2.3 RangeQueryProcessing . 77 3.2.3.1 Range Query Algorithm . 78 3.2.3.2 Multi-Destination Query Propagation . 79 3.2.4 KNNQueryProcessing . 84 3.2.4.1 Incremental KNN Query Algorithm . 85 3.2.4.2 KNN Search Space Refinement . 85 3.2.4.3 ApproximateKNNQuery. 90 3.2.5 SimulationSetup ......................... 90 3.2.6 SimulationResults . 91 3.2.6.1 RangeQuery ...................... 91 3.2.6.2 KNNQuery ...................... 96 3.2.7 Summary ............................. 104 Chapter 4. Managing Multi-Dimensional Data Objects .............. 106 4.1 Introduction................................ 107 viii 4.2 Preliminaries ............................... 110 4.2.1 Background............................ 110 4.2.1.1 BalancedTreeIndex. 110 4.2.1.2 SkipGraph....................... 112 4.2.1.3 Wavelet......................... 113 4.2.2 SystemModel........................... 116 4.3 DistributedPeerTree(DPTree). 116 4.3.1 OverviewofDPTree . 117 4.3.2 Overlay Structure and Navigation Algorithm . 119 4.3.2.1 Tree-AwareOverlay . 119 4.3.2.2 Aggressive Navigation . 121 4.3.3 Wavelet-assisted Load balancing . 125 4.3.3.1 LoadMonitoring . 128 4.3.3.2 Load Adjusting . 131 4.3.4 MaintenanceinDPTree . 135 4.3.4.1 Peer Join/Leave/Failure . 136 4.3.4.2 Data Insertion/Deletion . 137 4.4 ApplicationofDPTree . 141 4.4.1 RangeQuery ........................... 142 4.4.2 KNearestNeighborQuery . 142 4.5 PerformanceEvaluation . 143 4.5.1 LoadBalancing.......................... 143 4.5.1.1 EffectofNetworkSize . 144 ix 4.5.1.2 Effect of Initial Load Distribution . 146 4.5.1.3 EffectofWaveletSize . 146 4.5.1.4 Enhancement on LR Mechanisms . 147 4.5.2 Routing .............................. 149 4.5.2.1 EffectofNetworkSize . 149 4.5.2.2 EffectofPeerDistribution . 151 4.5.3 QueryPerformance. 151 4.5.3.1 PointQuery ...................... 152 4.5.3.2 RangeQuery ...................... 154 4.5.3.3 KNNQuery ...................... 154 4.5.4 MaintenanceOverheads . 155 4.5.4.1 Overlay Maintenance Overheads . 155 4.5.4.2 Tree Maintenance Overheads . 157 4.6 Summary ................................. 157 Chapter 5. Identifying Frequent Items ........................ 159 5.1 Introduction................................ 159 5.2 In-NetworkFiltering ........................... 166 5.2.1 Aggregate Computation . 168 5.2.1.1 FormingHierarchy . 169 5.2.1.2 Computing Aggregates . 171 5.2.1.3 Updating Hierarchy . 172 5.2.2 CandidateFiltering. 172 x 5.2.2.1 ItemPartitioning . 173 5.2.2.2 Candidate Set Optimization . 173 5.2.3 CandidateVerification . 176 5.3 AnalysisofnetFilter ........................... 177 5.3.1 CostModelfornetFilter. 179 5.3.2 Cost Model for the Naive Approach . 180 5.3.3 Optimal Setting for the Size of Filters (g) ........... 180 5.3.4 Optimal Setting for the Number of Filters (f)......... 181 5.3.5 Setting netFilter Optimally In Practice . 183 5.4 PerformanceEvaluation . 186 5.4.1 EffectoftheFilterSizes . 187 5.4.2 EffectoftheNumberofFilters . 190 5.4.3 EffectofDataSkewness . 192 5.4.4 EffectofThreshold. 195 5.5 Summary ................................. 195 Chapter 6. Monitoring Changes on the Data Distribution ............. 197 6.1 Introduction................................ 197 6.1.1 ProblemFormulation. 199 6.1.2 Contributions........................... 202 6.2 Preliminaries ............................... 204 6.3 Wavenet.................................. 205 6.3.1 DataSummarization . 206 xi 6.3.1.1 The Weakness of Histogram . 206 6.3.1.2 Localwavelet . 209 6.3.2 DesignIssuesofWavenet . 212 6.3.3 Localwavelet Construction in a Sparsely Populated Data Do- main................................ 213 6.3.4 Adaptive Monitoring . 218 6.3.4.1 LocalFilterSetup . 219 6.3.4.2 FilterResolution . 223 6.4 PerformanceEvaluation . 225 6.4.1 ExperimentsSetup . 225 6.4.1.1 DataSets........................ 226 6.4.1.2 Setting of Localwavelet and Histogram . 226 6.4.2 PerformanceMetrics . 228 6.4.3 Results .............................. 229 6.4.3.1 SummaryErrors . 230 6.4.3.2 EffectoftheThreshold . 232 6.4.3.3 Effect of the Summary Size . 234 6.4.3.4 Effect of Adaptive Monitoring . 234 6.4.3.5 EffectofLocalwaveletRefinement . 237 6.5 Summary ................................. 240 Chapter 7. Distributed
Recommended publications
  • Text Mining Course for KNIME Analytics Platform
    Text Mining Course for KNIME Analytics Platform KNIME AG Copyright © 2018 KNIME AG Table of Contents 1. The Open Analytics Platform 2. The Text Processing Extension 3. Importing Text 4. Enrichment 5. Preprocessing 6. Transformation 7. Classification 8. Visualization 9. Clustering 10. Supplementary Workflows Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 2 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ Overview KNIME Analytics Platform Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 3 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ What is KNIME Analytics Platform? • A tool for data analysis, manipulation, visualization, and reporting • Based on the graphical programming paradigm • Provides a diverse array of extensions: • Text Mining • Network Mining • Cheminformatics • Many integrations, such as Java, R, Python, Weka, H2O, etc. Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 4 Noncommercial-Share Alike license 2 https://creativecommons.org/licenses/by-nc-sa/4.0/ Visual KNIME Workflows NODES perform tasks on data Not Configured Configured Outputs Inputs Executed Status Error Nodes are combined to create WORKFLOWS Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 5 Noncommercial-Share Alike license 3 https://creativecommons.org/licenses/by-nc-sa/4.0/ Data Access • Databases • MySQL, MS SQL Server, PostgreSQL • any JDBC (Oracle, DB2, …) • Files • CSV, txt
    [Show full text]
  • A Survey of Clustering with Deep Learning: from the Perspective of Network Architecture
    Received May 12, 2018, accepted July 2, 2018, date of publication July 17, 2018, date of current version August 7, 2018. Digital Object Identifier 10.1109/ACCESS.2018.2855437 A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture ERXUE MIN , XIFENG GUO, QIANG LIU , (Member, IEEE), GEN ZHANG , JIANJING CUI, AND JUN LONG College of Computer, National University of Defense Technology, Changsha 410073, China Corresponding authors: Erxue Min ([email protected]) and Qiang Liu ([email protected]) This work was supported by the National Natural Science Foundation of China under Grant 60970034, Grant 61105050, Grant 61702539, and Grant 6097003. ABSTRACT Clustering is a fundamental problem in many data-driven application domains, and clustering performance highly depends on the quality of data representation. Hence, linear or non-linear feature transformations have been extensively used to learn a better data representation for clustering. In recent years, a lot of works focused on using deep neural networks to learn a clustering-friendly representation, resulting in a significant increase of clustering performance. In this paper, we give a systematic survey of clustering with deep learning in views of architecture. Specifically, we first introduce the preliminary knowledge for better understanding of this field. Then, a taxonomy of clustering with deep learning is proposed and some representative methods are introduced. Finally, we propose some interesting future opportunities of clustering with deep learning and give some conclusion remarks. INDEX TERMS Clustering, deep learning, data representation, network architecture. I. INTRODUCTION property of highly non-linear transformation. For the sim- Data clustering is a basic problem in many areas, such as plicity of description, we call clustering methods with deep machine learning, pattern recognition, computer vision, data learning as deep clustering1 in this paper.
    [Show full text]
  • A Tree-Based Dictionary Learning Framework
    A Tree-based Dictionary Learning Framework Renato Budinich∗ & Gerlind Plonka† June 11, 2020 We propose a new outline for adaptive dictionary learning methods for sparse encoding based on a hierarchical clustering of the training data. Through recursive application of a clustering method the data is organized into a binary partition tree representing a multiscale structure. The dictionary atoms are defined adaptively based on the data clusters in the partition tree. This approach can be interpreted as a generalization of a discrete Haar wavelet transform. Furthermore, any prior knowledge on the wanted structure of the dictionary elements can be simply incorporated. The computational complex- ity of our proposed algorithm depends on the employed clustering method and on the chosen similarity measure between data points. Thanks to the multi- scale properties of the partition tree, our dictionary is structured: when using Orthogonal Matching Pursuit to reconstruct patches from a natural image, dic- tionary atoms corresponding to nodes being closer to the root node in the tree have a tendency to be used with greater coefficients. Keywords. Multiscale dictionary learning, hierarchical clustering, binary partition tree, gen- eralized adaptive Haar wavelet transform, K-means, orthogonal matching pursuit 1 Introduction arXiv:1909.03267v2 [cs.LG] 9 Jun 2020 In many applications one is interested in sparsely approximating a set of N n-dimensional data points Y , columns of an n N real matrix Y = (Y1;:::;Y ). Assuming that the data j × N ∗R. Budinich is with the Fraunhofer SCS, Nordostpark 93, 90411 Nürnberg, Germany, e-mail: re- [email protected] †G. Plonka is with the Institute for Numerical and Applied Mathematics, University of Göttingen, Lotzestr.
    [Show full text]
  • Enhancement of DBSCAN Algorithm and Transparency Clustering Of
    IJRECE VOL. 5 ISSUE 4 OCT.-DEC. 2017 ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE) Enhancement of DBSCAN Algorithm and Transparency Clustering of Large Datasets Kumari Silky1, Nitin Sharma2 1Research Scholar, 2Assistant Professor Institute of Engineering Technology, Alwar, Rajasthan, India Abstract: The data mining is the technique which can extract process of dividing the data into similar objects groups. A level useful information from the raw data. The clustering is the of simplification is achieved in case of less number of clusters technique of data mining which can group similar and dissimilar involved. But because of less number of clusters some of the type of information. The density based clustering is the type of fine details have been lost. With the use or help of clusters the clustering which can cluster data according to the density. The data is modeled. According to the machine learning view, the DBSCAN is the algorithm of density based clustering in which clusters search in a unsupervised manner and it is also as the EPS value is calculated which define radius of the cluster. The hidden patterns. The system that comes as an outcome defines a Euclidian distance will be calculated using neural networks data concept [3]. The clustering mechanism does not have only which calculate similarity in the more effective manner. The one step it can be analyzed from the definition of clustering. proposed algorithm is implemented in MATLAB and results are Apart from partitional and hierarchical clustering algorithms analyzed in terms of accuracy, execution time. number of new techniques has been evolved for the purpose of clustering of data.
    [Show full text]
  • Comparison of Dimensionality Reduction Techniques on Audio Signals
    Comparison of Dimensionality Reduction Techniques on Audio Signals Tamás Pál, Dániel T. Várkonyi Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering, Telekom Innovation Laboratories, Budapest, Hungary {evwolcheim, varkonyid}@inf.elte.hu WWW home page: http://t-labs.elte.hu Abstract: Analysis of audio signals is widely used and this work: car horn, dog bark, engine idling, gun shot, and very effective technique in several domains like health- street music [5]. care, transportation, and agriculture. In a general process Related work is presented in Section 2, basic mathe- the output of the feature extraction method results in huge matical notation used is described in Section 3, while the number of relevant features which may be difficult to pro- different methods of the pipeline are briefly presented in cess. The number of features heavily correlates with the Section 4. Section 5 contains data about the evaluation complexity of the following machine learning method. Di- methodology, Section 6 presents the results and conclu- mensionality reduction methods have been used success- sions are formulated in Section 7. fully in recent times in machine learning to reduce com- The goal of this paper is to find a combination of feature plexity and memory usage and improve speed of following extraction and dimensionality reduction methods which ML algorithms. This paper attempts to compare the state can be most efficiently applied to audio data visualization of the art dimensionality reduction techniques as a build- in 2D and preserve inter-class relations the most. ing block of the general process and analyze the usability of these methods in visualizing large audio datasets.
    [Show full text]
  • DBSCAN++: Towards Fast and Scalable Density Clustering
    DBSCAN++: Towards fast and scalable density clustering Jennifer Jang 1 Heinrich Jiang 2 Abstract 2, it quickly starts to exhibit quadratic behavior in high di- mensions and/or when n becomes large. In fact, we show in DBSCAN is a classical density-based clustering Figure1 that even with a simple mixture of 3-dimensional procedure with tremendous practical relevance. Gaussians, DBSCAN already starts to show quadratic be- However, DBSCAN implicitly needs to compute havior. the empirical density for each sample point, lead- ing to a quadratic worst-case time complexity, The quadratic runtime for these density-based procedures which is too slow on large datasets. We propose can be seen from the fact that they implicitly must compute DBSCAN++, a simple modification of DBSCAN density estimates for each data point, which is linear time which only requires computing the densities for a in the worst case for each query. In the case of DBSCAN, chosen subset of points. We show empirically that, such queries are proximity-based. There has been much compared to traditional DBSCAN, DBSCAN++ work done in using space-partitioning data structures such can provide not only competitive performance but as KD-Trees (Bentley, 1975) and Cover Trees (Beygelzimer also added robustness in the bandwidth hyperpa- et al., 2006) to improve query times, but these structures are rameter while taking a fraction of the runtime. all still linear in the worst-case. Another line of work that We also present statistical consistency guarantees has had practical success is in approximate nearest neigh- showing the trade-off between computational cost bor methods (e.g.
    [Show full text]
  • Robust Hierarchical Clustering∗
    Journal of Machine Learning Research 15 (2014) 4011-4051 Submitted 12/13; Published 12/14 Robust Hierarchical Clustering∗ Maria-Florina Balcan [email protected] School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA Yingyu Liang [email protected] Department of Computer Science Princeton University Princeton, NJ 08540, USA Pramod Gupta [email protected] Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA Editor: Sanjoy Dasgupta Abstract One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across many different fields ranging from computational biology to social sciences to computer vision in part because their output is easy to interpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise. In this paper we propose and analyze a new robust algorithm for bottom-up agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. Experimental evaluations on synthetic and real world data sets show that our algorithm achieves better performance than other hierarchical algorithms in the presence of noise. Keywords: unsupervised learning, clustering, agglomerative algorithms, robustness 1. Introduction Many data mining and machine learning applications ranging from computer vision to biology problems have recently faced an explosion of data. As a consequence it has become increasingly important to develop effective, accurate, robust to noise, fast, and general clustering algorithms, accessible to developers and researchers in a diverse range of areas.
    [Show full text]
  • Density-Based Clustering of Static and Dynamic Functional MRI Connectivity
    Rangaprakash et al. Brain Inf. (2020) 7:19 https://doi.org/10.1186/s40708-020-00120-2 Brain Informatics RESEARCH Open Access Density-based clustering of static and dynamic functional MRI connectivity features obtained from subjects with cognitive impairment D. Rangaprakash1,2,3, Toluwanimi Odemuyiwa4, D. Narayana Dutt5, Gopikrishna Deshpande6,7,8,9,10,11,12,13* and Alzheimer’s Disease Neuroimaging Initiative Abstract Various machine-learning classifcation techniques have been employed previously to classify brain states in healthy and disease populations using functional magnetic resonance imaging (fMRI). These methods generally use super- vised classifers that are sensitive to outliers and require labeling of training data to generate a predictive model. Density-based clustering, which overcomes these issues, is a popular unsupervised learning approach whose util- ity for high-dimensional neuroimaging data has not been previously evaluated. Its advantages include insensitivity to outliers and ability to work with unlabeled data. Unlike the popular k-means clustering, the number of clusters need not be specifed. In this study, we compare the performance of two popular density-based clustering methods, DBSCAN and OPTICS, in accurately identifying individuals with three stages of cognitive impairment, including Alzhei- mer’s disease. We used static and dynamic functional connectivity features for clustering, which captures the strength and temporal variation of brain connectivity respectively. To assess the robustness of clustering to noise/outliers, we propose a novel method called recursive-clustering using additive-noise (R-CLAN). Results demonstrated that both clustering algorithms were efective, although OPTICS with dynamic connectivity features outperformed in terms of cluster purity (95.46%) and robustness to noise/outliers.
    [Show full text]
  • Space-Time Hierarchical Clustering for Identifying Clusters in Spatiotemporal Point Data
    International Journal of Geo-Information Article Space-Time Hierarchical Clustering for Identifying Clusters in Spatiotemporal Point Data David S. Lamb 1,* , Joni Downs 2 and Steven Reader 2 1 Measurement and Research, Department of Educational and Psychological Studies, College of Education, University of South Florida, 4202 E Fowler Ave, Tampa, FL 33620, USA 2 School of Geosciences, University of South Florida, 4202 E Fowler Ave, Tampa, FL 33620, USA; [email protected] (J.D.); [email protected] (S.R.) * Correspondence: [email protected] Received: 2 December 2019; Accepted: 27 January 2020; Published: 1 February 2020 Abstract: Finding clusters of events is an important task in many spatial analyses. Both confirmatory and exploratory methods exist to accomplish this. Traditional statistical techniques are viewed as confirmatory, or observational, in that researchers are confirming an a priori hypothesis. These methods often fail when applied to newer types of data like moving object data and big data. Moving object data incorporates at least three parts: location, time, and attributes. This paper proposes an improved space-time clustering approach that relies on agglomerative hierarchical clustering to identify groupings in movement data. The approach, i.e., space–time hierarchical clustering, incorporates location, time, and attribute information to identify the groups across a nested structure reflective of a hierarchical interpretation of scale. Simulations are used to understand the effects of different parameters, and to compare against existing clustering methodologies. The approach successfully improves on traditional approaches by allowing flexibility to understand both the spatial and temporal components when applied to data. The method is applied to animal tracking data to identify clusters, or hotspots, of activity within the animal’s home range.
    [Show full text]
  • Chapter G03 – Multivariate Methods
    g03 – Multivariate Methods Introduction – g03 Chapter g03 – Multivariate Methods 1. Scope of the Chapter This chapter is concerned with methods for studying multivariate data. A multivariate data set consists of several variables recorded on a number of objects or individuals. Multivariate methods can be classified as those that seek to examine the relationships between the variables (e.g., principal components), known as variable-directed methods, and those that seek to examine the relationships between the objects (e.g., cluster analysis), known as individual-directed methods. Multiple regression is not included in this chapter as it involves the relationship of a single variable, known as the response variable, to the other variables in the data set, the explanatory variables. Routines for multiple regression are provided in Chapter g02. 2. Background 2.1. Variable-directed Methods Let the n by p data matrix consist of p variables, x1,x2,...,xp,observedonn objects or individuals. Variable-directed methods seek to examine the linear relationships between the p variables with the aim of reducing the dimensionality of the problem. There are different methods depending on the structure of the problem. Principal component analysis and factor analysis examine the relationships between all the variables. If the individuals are classified into groups then canonical variate analysis examines the between group structure. If the variables can be considered as coming from two sets then canonical correlation analysis examines the relationships between the two sets of variables. All four methods are based on an eigenvalue decomposition or a singular value decomposition (SVD)of an appropriate matrix. The above methods may reduce the dimensionality of the data from the original p variables to a smaller number, k, of derived variables that adequately represent the data.
    [Show full text]
  • A Study of Hierarchical Clustering Algorithm
    International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 11 (2013), pp. 1225-1232 © International Research Publications House http://www. irphouse.com /ijict.htm A Study of Hierarchical Clustering Algorithm Yogita Rani¹ and Dr. Harish Rohil2 1Reseach Scholar, Department of Computer Science & Application, CDLU, Sirsa-125055, India. 2Assistant Professor, Department of Computer Science & Application, CDLU, Sirsa-125055, India. Abstract Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but these objects are very dissimilar to the objects that are in other clusters. Clustering methods are mainly divided into two groups: hierarchical and partitioning methods. Hierarchical clustering combine data objects into clusters, those clusters into larger clusters, and so forth, creating a hierarchy of clusters. In partitioning clustering methods various partitions are constructed and then evaluations of these partitions are performed by some criterion. This paper presents detailed discussion on some improved hierarchical clustering algorithms. In addition to this, author have given some criteria on the basis of which one can also determine the best among these mentioned algorithms. Keywords: Hierarchical clustering; BIRCH; CURE; clusters ;data mining. 1. Introduction Data mining allows us to extract knowledge from our historical data and predict outcomes of our future situations. Clustering is an important data mining task. It can be described as the process of organizing objects into groups whose members are similar in some way. Clustering can also be define as the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters.
    [Show full text]
  • An Improvement for DBSCAN Algorithm for Best Results in Varied Densities
    Computer Engineering Department Faculty of Engineering Deanery of Higher Studies The Islamic University-Gaza Palestine An Improvement for DBSCAN Algorithm for Best Results in Varied Densities Mohammad N. T. Elbatta Supervisor Dr. Wesam M. Ashour A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Gaza, Palestine (September, 2012) 1433 H ii stnkmegnelnonkcA Apart from the efforts of myself, the success of any project depends largely on the encouragement and guidelines of many others. I take this opportunity to express my gratitude to the people who have been instrumental in the successful completion of this project. I would like to thank my parents for providing me with the opportunity to be where I am. Without them, none of this would be even possible to do. You have always been around supporting and encouraging me and I appreciate that. I would also like to thank my brothers and sisters for their encouragement, input and constructive criticism which are really priceless. Also, special thanks goes to Dr. Homam Feqawi who did not spare any effort to review and audit my thesis linguistically. My heartiest gratitude to my wonderful wife, Nehal, for her patience and forbearance through my studying and preparing this study. I would like to express my sincere gratitude to my advisor Dr. Wesam Ashour for the continuous support of my master study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my master study.
    [Show full text]