Effect of Distance Measures on Partitional Clustering Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Effect of Distance Measures on Partitional Clustering Algorithms Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 Effect of Distance measures on Partitional Clustering Algorithms using Transportation Data Sesham Anand#1, P Padmanabham*2, A Govardhan#3 #1Dept of CSE,M.V.S.R Engg College, Hyderabad, India *2Director, Bharath Group of Institutions, BIET, Hyderabad, India #3Dept of CSE, SIT, JNTU Hyderabad, India Abstract— Similarity/dissimilarity measures in clustering research of algorithms, with an emphasis on unsupervised algorithms play an important role in grouping data and methods in cluster analysis and outlier detection[1]. High finding out how well the data differ with each other. The performance is achieved by using many data index importance of clustering algorithms in transportation data has structures such as the R*-trees.ELKI is designed to be easy been illustrated in previous research. This paper compares the to extend for researchers in data mining particularly in effect of different distance/similarity measures on a partitional clustering algorithm kmedoid(PAM) using transportation clustering domain. ELKI provides a large collection of dataset. A recently developed data mining open source highly parameterizable algorithms, in order to allow easy software ELKI has been used and results illustrated. and fair evaluation and benchmarking of algorithms[1]. Data mining research usually leads to many algorithms Keywords— clustering, transportation Data, partitional for similar kind of tasks. If a comparison is to be made algorithms, cluster validity, distance measures between these algorithms.In ELKI, data mining algorithms and data management tasks are separated and allow for an I. INTRODUCTION independent evaluation. This aspect is unique to ELKI Clustering as an unsupervised data mining technique and among other data mining frameworks like Weka or has been widely used in various applications for analyzing Rapidminer, Tanagra etc. ELKI has many implementations data. Kmeans and kmedoid algorithms are two very widely of, distances or similarity measure file formats, data types used partitional algorithms. Since kmeans is more suitable and is open to develop new ones also. using Euclidean distance metric, in order to study other Visualizing data and results is also available and ELKI distance measures, kmedoids algorithm has been chosen for supports evaluation of different algorithms using different the study. Details of transportation data, its attributes, aim measures like pair-counting, Entropy based, BCubed based, of research, details of application of clustering etc has been Set-Matching based, Editing-distance based, Gini measures presented previously[2,3,4,5].Also, one other aim of the etc. research was to compare different open source software[4]. Although not exhaustive, Table 1 shows the wide variety Previous research illustrated the importance and use of of clustering algorithms implemented in ELKI. All clustering on transportation data and results of applying algorithms are highly efficient, and a wide variety of clustering was indicated. It has been observed that the parameters can be used to customize them. clustering results varied with different algorithms and using different parameters. Table 1: Clustering algorithms in ELKI The need of current research is to study the effect of Category Algorithms Implemented distance metrics on the clustering algorithms and to come kmeans(Llyod and MacQueen), out with recommendations on suitable distance measures to kmedoids(PAM &EM), be used and whether suitable modification is to be made for Partitional kMedians(Llyod),kmeansBisecting, a particular distance measure specifically for the Clustering kmeansBatchedLlyod, transportation dataset. The work has been done using ELKI[1] software. It is a recently developed java based data kmeansHybridLlyodMacQueen, mining software. Different distances have been selected ExtractFlatClusteringFromHierarchy, using kmedoids algorithm and different parameters have Hierarchical NaiveAgglomerativeHierarchicalClustering, been recorded. Suitable conclusions have been made. All SLINKierarchicHie the different cluster evaluation measures and distances used CASH, COPAC, ERiC, FourC, HiCO, Correlation in the work are discussed.e. LMCLUS, ORCLUS, II ELKI SOFTWARE CLIQUE, DiSH, DOC, HiSC, P3C, The software used for the present research is ELKI[1], Subspace PROCLUS, SUBCLU, PreDeCon, which stands for Environment for Developing KDD- Applications Supported by Index-Structures. It is a Java Canopy PreClustering, DBScan, based open source software. The current version used is Others DeLiClu,EM, ELKI 0.6.0.The original aim of ELKI is to be used in AffinityPropagatingClustering www.ijcsit.com 5308 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 III DEFINITION AND MOTIVATION FOR CURRENT RESEARCH Author’s previous research[2,3,4,5] has showed the original aim of research and application of clustering techniques on transportation data. Ultimately it is proposed to come out with recommendations on suitable algorithms and distance measures suitable to the HIS Transportation dataset. Preprocessing of the dataset and suitable dimensionality reduction[5] techniques were applied on the data, and it is concluded that clustering on factor scores is more suitable than on original data. It now remains to be seen what distance measures are suitable and need to be used. The objective of clustering the data has been to determine answers to questions such as how rich or prosperous a household is, their mobility levels, vehicle availability etc. The intention is to analyze why people are investing money Fig 1: ELKI Home Screen to acquire money and whether it is related to inadequacy of public transport etc. Also is it possible to predict the ELKI home screen provides a lot of parameters for potential of households and likely to buy vehicles in future selecting clustering and customizing them. Fig 1 shows etc. initial menu of the software. A lot of clustering algorithms have been implemented in ELKI unlike in other open IV ALGORITHM USED source software. Easy to use menus allow the researcher to This paper is intended to study the effect of distance select and customize various parameters as shown in Fig 2. measures and initialization methods in partitional clustering algorithms when applied to transportation dataset. Two of the most widely used partitional clustering algorithms are kmeans[7] and kmedoids[6](also known as partitioning around medoids PAM). Partitional clustering algorithms construct k clusters or classifies data into k groups which satisfy the partition requirements: Each group must contain at least one object and each object must belong to at least one group. All these algorithms require the k value as input. Since the dataset under consideration is numeric, these two algorithms are suitable for application on the dataset. kmeans algorithm is more widely used than kmedoids in the research literature but it has some drawbacks. kmeans algorithm[7] is more suitable using Euclidean distance and its derivatives. If any other similarity measures is used, it Fig 2: ELKI algorithm selection may not converge to global minima. kmeans is designed naturally for squared Euclidean distance. Precisely for this Although not yet mature, ELKI provides to some extent reason, kmedoids algorithm has been chosen for further visualization facility to see the results of clustering and data work and the intention is to change the distance measures as shown in Fig 3. for each run of the algorithm and see the effect. A Algorithm Kmedoids algorithm has two phases[6], a build phase and a swap phase. In the first phase, an initial clustering is done by selecting k representative objects. The first object is selected for which the sum of dissimilarities to all other objects is as small as possible. All subsequent steps involve selecting all other remaining k-1 objects. In order to select the objects the following steps are carried out[6]: 1. Select an object i which is not yet selected 2. Select a non-selected object j and calculate the difference between its dissimilarity Dj with the most similar previously selected object and its dissimilarity d(j,i) with object i. Fig 3: Visualization of Data www.ijcsit.com 5309 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 3. If this difference is positive, object j will c) FarthestPointsInitialMeans: The farthest-point method contribute to the decision to select object i. picks up an arbitrary point p. It now picks a point q that is therefore Cji = max(Dj – d(j,i),0) as far from p as possible. Then select r to maximize the 4. Calculate the total gain obtained by selecting distance to the nearest of all centroids picked so far. That is, object i: ΣCji maximize the min {dist (r, p), dist (r, q) ...}. After all k 5. Choose the not yet selected object I which representatives are chosen then the partition of Data D is maximizes ΣCji done: a cluster Cj will now consist of all points closer to 6. These steps are carried out until k objects have representative sj than to any other representative. been found. d) PAMInitialMeans: This method[6] is based on The second phase consists of trying to improve the partitioning around medoids(PAM) algorithm which representative objects, thereby improving the overall searches for k representative objects in the dataset
Recommended publications
  • Feature Selection for Gender Classification in TUIK Life Satisfaction Survey A
    Feature Selection for Gender Classification in TUIK Life Satisfaction Survey A. ÇOBAN1 and İ. TARIMER2 1 Muğla Sıtkı Koçman University, Muğla/Turkey, [email protected] 2Muğla Sıtkı Koçman University, Muğla/Turkey, [email protected] Abstract— As known, attribute selection is a in [8] [9] [10]. method that is used before the classification of data Life satisfaction is a cognitive and judicial mining. In this study, a new data set has been created situation which expresses the evaluation of life as a by using attributes expressing overall satisfaction in whole. Happiness on the other hand is conceived as Turkey Statistical Institute (TSI) Life Satisfaction Survey dataset. Attributes are sorted by Ranking an emotional state produced by positive and search method using attribute selection algorithms in negative events and experiences in the life of the a data mining application. These selected attributes individual. Although there are some correlations were subjected to a classification test with Naive between happiness and life satisfaction at different Bayes and Random Forest from machine learning levels, these concepts are still different [11]. algorithms. The feature selection algorithms are The concept of subjective well-being, which we compared according to the number of attributes selected and the classification accuracy rates cannot separate from the concept of happiness, is achievable with them. In this study, which is aimed at defined as people's evaluations of their quality of reducing the dataset volume, the best classification life [6] [7]. Researches and surveys on life result comes up with 3 attributes selected by the Chi2 satisfaction and happiness have been used as algorithm.
    [Show full text]
  • An Analysis on the Performance of Rapid Miner and R Programming Language As Data Pre-Processing Tools for Unsupervised Form of Insurance Claim Dataset
    International Journal of Computer Sciences and Engineering Open Access Research Paper Vol.-7, Special Issue, 5, March 2019 E-ISSN: 2347-2693 An Analysis on the Performance of Rapid Miner and R Programming Language as Data Pre-processing Tools for Unsupervised Form of Insurance Claim Dataset Surya Susan Thomas1*, Ananthi Sheshasaayee2, 1,2PG & Research Department of Computer Science, Quaid -E- Millath Government College for Women, Chennai, India *Corresponding Author: [email protected], 9940439667 DOI: https://doi.org/10.26438/ijcse/v7si5.14 | Available online at: www.ijcseonline.org Abstract— Data Science has emerged as a super science in almost all the sectors of analytics. Data Mining is the key runner and the pillar stone of data analytics. The analysis and study of any form of data has become so relevant in todays’ scenario and the output from these studies give great societal contributions and hence are of great value. Data analytics involves many steps and one of the primary and the most important one is data pre-processing stage. Raw data has to be cleaned, stabilized and processed to a new form to make the analysis easier and correct. Many pre-processing tools are available but this paper specifically deals with the comparative study of two tools such as Rapid Miner and R programming language which are predominantly used by data analysts. The output of the paper gives an insight into the weightage of the particular tool which can be recommended for better data pre-processing. Keywords- Data analytics, data pre-processing, noise removal, clean data, Rapid Miner, R programming I.
    [Show full text]
  • Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması
    Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması Mümine KAYA1, Selma Ayşe ÖZEL 2 1 Adana Bilim ve Teknoloji Üniversitesi, Bilgisayar Mühendisliği Bölümü, Adana 2 Çukurova Üniversitesi, Bilgisayar Mühendisliği Bölümü, Adana [email protected], [email protected] Özet: Veri Madenciliği, büyük miktarda veri içinden gizli bağıntı ve kuralların, bilgisayar yazılımları ve istatiksel yöntemler kullanılarak çıkarılması işlemidir. Veri madenciliği yöntemleri ve yazılımlarının amacı büyük miktarlardaki verileri etkin ve verimli bir şekilde işlemektir. Yapılan çalışmada; açık kaynak kodlu veri madenciliği yazılımlarından Keel, Knime, Orange, R, RapidMiner (Yale) ve Weka karşılaştırılmıştır. Böylece kullanılacak veri kümeleri için hangi yazılımın daha etkin bir şekilde çalışacağı belirlenebilmiştir. Anahtar Sözcükler: Veri Madenciliği, Açık Kaynak, Veri Madenciliği Yazılımları. Comparison of Open Source Data Mining Software Abstract: Data Mining is a process of discovering hidden correlations and rules within large amounts of data using computer software and statistical methods. The aim of data mining methods and software is to process large amounts of data efficiently and effectively. In this study, open source data mining tools namely Keel, Knime, Orange, R, RapidMiner (Yale), and Weka were compared. As a result of this study, it is possible to determine which data mining software is more efficient and effective for which kind of data sets. Keywords: Data Mining, Open Source, Data Mining Software. 1. Giriş madenciliği büyük ölçekli veriler arasından yararlı ve anlaşılır olanların bulunup ortaya Günümüzde bilişim teknolojisi, veri iletişim çıkarılması işlemidir [1]. Veri Madenciliği ile teknolojileri ve veri toplama araçları oldukça veriler arasındaki ilişkiler ortaya gelişmiş ve yaygınlaşmış; bu hızlı gelişim koyulabilmekte ve gelecekle ilgili büyük boyutlu veri kaynaklarının oluşmasına tahminlerde bulunulabilmektedir.
    [Show full text]
  • Industry Applications of Machine Learning and Data Science
    #1 Agile Predictive Analytics Platform for Today’s Modern Analysts Industry Applications of Machine Learning and Data Science Ralf Klinkenberg, Co-Founder & Head of Data Science Research, RapidMiner [email protected] www.RapidMiner.com Industrial Data Science Conference (IDS 2019), Dortmund, Germany, March 13th, 2019 Can you predict the future? ©2015©2016 RapidMiner, Inc. All rights reserved. - 2 - Predictive Analytics finds the hidden patterns in big data and uses them to predict future events. 473ms © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved. - 3 - 3 Machine Learning: Pattern Detection, Trend Detection, Finding Correlations & Causal Relations, etc. from Data to Create Models that Enable the Automated Classification of New Cases, Forecasting of Events or Values, Prediction of Risks & Opportunities ©2015 RapidMiner, Inc. All rights reserved. - 4 - Predictive Analytics Transforms Insight into ACTION Prescriptive ACT Operationalize Predictive ANTICIPATE What will happen Diagnostic EXPLAIN Why did it happen Value Descriptive OBSERVE Analytics What happened ©2016 RapidMiner, Inc. All rights reserved. - 5 - Industry Applications of Machine Learning and Predictive Analytics ©2016 RapidMiner, Inc. All rights reserved. - 6 - Customer Churn Prediction: Predict Which Customers are about to Churn. Energy Provider E.ON: 17 Million Customers. => Predict & Prevent Churn => Secure Revenue, Less Costly Than Acquiring New Customers. © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved. - 7 - Demand Forecasting: Predict Which Book Will be Sold How Often in Which Region. Book Retailer Libri: 33 Million Transactions per Year. => Guarantee Availability and Delivery Times. © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved. - 8 - Predictive Maintenance: Predict Machine Failures before They Happen in order to Prevent Them, => Demand-Based Maintenance, Fewer Failures, Lower Costs © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved.
    [Show full text]
  • BETULA: Numerically Stable CF-Trees for BIRCH Clustering?
    BETULA: Numerically Stable CF-Trees for BIRCH Clustering? Andreas Lang[0000−0003−3212−5548] and Erich Schubert[0000−0001−9143−4880] TU Dortmund University, Dortmund, Germany fandreas.lang,[email protected] Abstract. BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. After- ward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering. Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation. We introduce a replacement cluster feature that does not have this nu- meric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These clus- ter features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demon- strate the numerical problem and compare the performance of the orig- inal algorithm compared to the improved cluster features. 1 Introduction The BIRCH algorithm [23,24,22] is a widely known cluster analysis approach, that won the 2006 SIGMOD Test of Time Award.
    [Show full text]
  • Mining the Web of Linked Data with Rapidminer
    Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015) 142–151 Contents lists available at ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: www.elsevier.com/locate/websem Mining the Web of Linked Data with RapidMiner Petar Ristoski ∗, Christian Bizer, Heiko Paulheim Data and Web Science Group, University of Mannheim, B6, 26, 68159 Mannheim, Germany article info a b s t r a c t Article history: Lots of data from different domains are published as Linked Open Data (LOD). While there are quite Received 30 January 2015 a few browsers for such data, as well as intelligent tools for particular purposes, a versatile tool for Received in revised form deriving additional knowledge by mining the Web of Linked Data is still missing. In this system paper, we 11 May 2015 introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining Accepted 11 June 2015 and analysis platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, Available online 8 July 2015 allowing for using it in sophisticated data analysis workflows without the need for expert knowledge in SPARQL or RDF. The extension allows for autonomously exploring the Web of Data by following Keywords: Linked Open Data links, thereby discovering relevant datasets on the fly, as well as for integrating overlapping data found Data mining in different datasets. As an example, we show how statistical data from the World Bank on scientific RapidMiner publications, published as an RDF data cube, can be automatically linked to further datasets and analyzed using additional background knowledge from ten different LOD datasets.
    [Show full text]
  • Rapidminer Operator Reference Manual ©2014 by Rapidminer
    RapidMiner Operator Reference Manual ©2014 by RapidMiner. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of RapidMiner. Preface Welcome to the RapidMiner Operator Reference, the final result of a long work- ing process. When we first started to plan this reference, we had an extensive discussion about the purpose of this book. Would anybody want to read the hole book? Starting with Ada Boost and reading the description of every single operator to the X-Validation? Or would it only serve for looking up particular operators, although that is also possible in the program interface itself? We decided for the latter and with growing progress in the making of this book, we realized how fuitile this entire discussion has been. It was not long until the book reached the 600 pages limit and now we nearly hit the 1000 pages, what is far beyond anybody would like to read entirely. Even if there would be a great structure, explaining the usage of single groups of operators as guiding transitions between the explanations of single operators, nobody could comprehend all that. The reader would have long forgotten about the Loop Clusters operator until he get's to know about cross validation. So we didn't dump any effort in that and hence the book has become a pure reference. For getting to know RapidMiner itself, this is not a suitable document. Therefore we would rather recommend to read the manual as a starting point.
    [Show full text]
  • Research Techniques in Network and Information Technologies, February
    Tools to support research M. Antonia Huertas Sánchez PID_00185350 CC-BY-SA • PID_00185350 Tools to support research The texts and images contained in this publication are subject -except where indicated to the contrary- to an Attribution- ShareAlike license (BY-SA) v.3.0 Spain by Creative Commons. This work can be modified, reproduced, distributed and publicly disseminated as long as the author and the source are quoted (FUOC. Fundació per a la Universitat Oberta de Catalunya), and as long as the derived work is subject to the same license as the original material. The full terms of the license can be viewed at http:// creativecommons.org/licenses/by-sa/3.0/es/legalcode.ca CC-BY-SA • PID_00185350 Tools to support research Index Introduction............................................................................................... 5 Objectives..................................................................................................... 6 1. Management........................................................................................ 7 1.1. Databases search engine ............................................................. 7 1.2. Reference and bibliography management tools ......................... 18 1.3. Tools for the management of research projects .......................... 26 2. Data Analysis....................................................................................... 31 2.1. Tools for quantitative analysis and statistics software packages ......................................................................................
    [Show full text]
  • Machine Learning for Improved Data Analysis of Biological Aerosol Using the WIBS
    Atmos. Meas. Tech., 11, 6203–6230, 2018 https://doi.org/10.5194/amt-11-6203-2018 © Author(s) 2018. This work is distributed under the Creative Commons Attribution 4.0 License. Machine learning for improved data analysis of biological aerosol using the WIBS Simon Ruske1, David O. Topping1, Virginia E. Foot2, Andrew P. Morse3, and Martin W. Gallagher1 1Centre of Atmospheric Science, SEES, University of Manchester, Manchester, UK 2Defence, Science and Technology Laboratory, Porton Down, Salisbury, UK 3Department of Geography and Planning, University of Liverpool, Liverpool, UK Correspondence: Simon Ruske ([email protected]) Received: 19 April 2018 – Discussion started: 18 June 2018 Revised: 15 October 2018 – Accepted: 26 October 2018 – Published: 19 November 2018 Abstract. Primary biological aerosol including bacteria, fun- The lowest classification errors were obtained using gra- gal spores and pollen have important implications for public dient boosting, where the misclassification rate was between health and the environment. Such particles may have differ- 4.38 % and 5.42 %. The largest contribution to the error, in ent concentrations of chemical fluorophores and will respond the case of the higher misclassification rate, was the pollen differently in the presence of ultraviolet light, potentially al- samples where 28.5 % of the samples were incorrectly clas- lowing for different types of biological aerosol to be dis- sified as fungal spores. The technique was robust to changes criminated. Development of ultraviolet light induced fluores- in data preparation provided a fluorescent threshold was ap- cence (UV-LIF) instruments such as the Wideband Integrated plied to the data. Bioaerosol Sensor (WIBS) has allowed for size, morphology In the event that laboratory training data are unavailable, and fluorescence measurements to be collected in real-time.
    [Show full text]
  • ML-Flex: a Flexible Toolbox for Performing Classification Analyses
    JournalofMachineLearningResearch13(2012)555-559 Submitted 6/11; Revised 2/12; Published 3/12 ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel Stephen R. Piccolo [email protected] Department of Pharmacology and Toxicology, School of Pharmacy University of Utah Salt Lake City, UT 84112, USA Lewis J. Frey [email protected] Huntsman Cancer Institute Department of Biomedical Informatics, School of Medicine University of Utah Salt Lake City, UT 84112, USA Editor: Geoff Holmes Abstract Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics do- main, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was writ- ten in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. ML- Flex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggre- gating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research 1. Introduction The machine-learning community has developed a wide array of classification algorithms, but they are implemented in diverse programming languages, have heterogeneous interfaces, and require disparate file formats. Also, because input data come in assorted formats, custom transformations often must precede classification. To address these challenges, we developed ML-Flex, a general- purpose toolbox for performing two-class and multi-class classification analyses.
    [Show full text]
  • Machine Learning
    1 LANDSCAPE ANALYSIS MACHINE LEARNING Ville Tenhunen, Elisa Cauhé, Andrea Manzi, Diego Scardaci, Enol Fernández del Authors Castillo, Technology Machine learning Last update 17.12.2020 Status Final version This work by EGI Foundation is licensed under a Creative Commons Attribution 4.0 International License This template is based on work, which was released under a Creative Commons 4.0 Attribution License (CC BY 4.0). It is part of the FitSM Standard family for lightweight IT service management, freely available at www.fitsm.eu. 2 DOCUMENT LOG Issue Date Comment Author init() 8.9.2020 Initialization of the document VT The second 16.9.2020 Content to the most of chapters VT, AM, EC version Version for 6.10.2020 Content to the chapter 3, 7 and 8 VT, EC discussions Chapter 2 titles 8.10.2020 Chapter 2 titles and Acumos added VT and bit more to the frameworks Towards first full 16.11.2020 Almost every chapter has edited VT version First full version 17.11.2020 All chapters reviewed and edited VT Addition 19.11.2020 Added Mahout and H2O VT Fixes 22.11.2020 Bunch of fixes based on Diego’s VT comments Fixes based on 24.11.2020 2.4, 3.1.8, 6 VT discussions on previous day Final draft 27.11.2020 Chapters 5 - 7, Executive summary VT, EC Final draft 16.12.2020 Some parts revised based on VT comments of Álvaro López García Final draft 17.12.2020 Structure in the chapter 3.2 VT updated and smoke libraries etc.
    [Show full text]
  • Principal Component Analysis
    02.01.2019 Prncpal component analyss - Wkpeda Principal component analysis Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. If there are observations with variables, then the number of distinct principal components is . This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n PCA of a multivariate Gaussian observations) are an uncorrelated orthogonal basis set. PCA is sensitive to distribution centered at (1,3) with a the relative scaling of the original variables. standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in PCA was invented in 1901 by Karl Pearson,[1] as an analogue of the the orthogonal direction. The principal axis theorem in mechanics; it was later independently developed vectors shown are the eigenvectors of the covariance matrix scaled by and named by Harold Hotelling in the 1930s.[2] Depending on the field of the square root of the corresponding application, it is also named the discrete Karhunen–Loève transform eigenvalue, and shifted so their tails (KLT) in signal processing, the Hotelling transform in multivariate quality are at the mean.
    [Show full text]