Large Scale Clustering

Total Page:16

File Type:pdf, Size:1020Kb

Large Scale Clustering Clustering at Large Scales Veronika Strnadová CSC Lab What is clustering? Finding groups of highly similar objects in unlabeled datasets or: Finding groups of highly connected vertices in a graph 퐶1 퐶2 cluster 퐶3 Applications • Image Segmentation • Normalized Cuts and Image Segmentation. Jianbo Shi and Jitendra Malik, 2001 • Load Balancing in Parallel Computing • Multilevel k-way Partitioning Scheme for Irregular Graphs. George Karypis and Vipin Kumar. 1998 • Genetic Mapping 퐿퐺1 퐿퐺2 • Computational approaches and software tools for genetic linkage map estimation in plants. Jitendra Cheema and Jo Dicks. 2009 • Community Detection • Modularity and community structure in networks. Mark E.J. Newman. 2006 Clustering in Genetic Mapping “The problem of genetic mapping can essentially be divided into three parts: grouping, ordering, and spacing” Computational approaches and software tools for genetic linkage map estimation in plants. Cheema 2009 Challenges in the large scale setting Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales Challenges in the large scale setting Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales “The computational issue becomes critical when the application involves large-scale data” Jain, Data Clustering: 50 Years Beyond 푘-means Challenges in the large scale setting Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales “The computational issue becomes critical when the application involves large-scale data” Jain, Data Clustering: 50 Years Beyond 푘-means “There is no clustering algorithm that can be universally used to solve all problems” Xu & Wunsch, A Comprehensive Overview of Basic Clustering Algorithms Popular Clustering Techniques • 푘-means • K-means/K-medians/K-medoids • Better, bigger k-means • Spectral • Optimizing NCut • PIC • Hierarchical • Single Linkage • BIRCH • CURE • Density-based • DBSCAN • II DBSCAN • Multilevel • Metis • ParMetis • Multithreaded Metis • Modularity • Community Detection 푘-means • “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 푘-means • “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 • “The 푘-means algorithm is very simple and can be easily implemented in solving many practical problems.” Xu and Wunsch, Survey of Clustering Algorithms. 2005 푘-means • “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 • “The 푘-means algorithm is very simple and can be easily implemented in solving many practical problems.” Xu and Wunsch, Survey of Clustering Algorithms. 2005 • “In spite of the fact that 푘-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, 푘-means is still widely used.” Jain, Data Clustering: 50 Years Beyond 푘-means. 푘-means 푑 Given: 푥1, 푥2, … , 푥푛 ∈ 푅 , 푘 푘 푛 2 Objective: minimize 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 where 푟푖푗 is an indicator function Initialize 휇1, 휇2, … , 휇푘 Repeat: 1. Find optimal assignments of 푥푖 to 휇푗 2. Re-evaluate 휇푗 given assignments 푥푖 푘 푛 2 Until 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 converges to a local minimum Example: 푘 = 2 Step 2 Step 1 Step 2 Initialization + Step 1 Least Squares Quantization in PCM. Stuart P. Lloyd 1982 Convergence of 푘 means 푘 푛 2 Let J = 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 • Minimize 퐽 with respect to 푟: (discrete optimization) • Simply assigning each 푥푖 to its closest 휇 ensures that ||푥푖 − 휇푗|| is minimized for each 푥푖, so let: 1 , if 푗 = 푎푟푔푚푖푛 ||푥 − 휇 ||2 • 푟 = 1,…푘 푖 푗 2 푖푗 0, otherwise • Minimize 퐽 with respect to 휇: (set derivative of J wrt 휇푗 to 0) 푛 1 푛 • −2 푖=1 푟푛푗 푥푖 − 휇푗 = 0 → 휇푗 = 푖=1 푟푛푗푥푖 푛푗 Distributed K-Means 푑 Given: 푥1, 푥2, … , 푥푛 ∈ 푅 , 푘 푘 푛 2 Objective: minimize 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 where 푟푖푗 is an indicator function Initialize 휇1, 휇2, … , 휇푘 and keep as a global variable Repeat: 1. Map: Find optimal assignments of 풙풊 to 흁풋 (map: outputs key=풋, value = 풙풊) 2. Reduce: Re-evaluate 흁풋 given assignments of 풙풊 (combine: partial sums of 풙풊 with same key) (reduce: takes 풋 (key) and list of values 풙풊, computes a new 흁풋) 푘 푛 2 Until 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 converges to a local minimum Parallel k-Means Using MapReduce. Zhao 2009 Divide-and-conquer k-Medians 푆 Divide 푥푖 into 푙 groups 휒1, … , 휒푙 For each 푖, find 푂 푘 centers 휒1 휒2 휒3 휒4 Assign each 푥푖 to its closest center Weigh each center by the 푐41 number of points assigned to it 푐42 Let 휒′ be the set of weighted centers Cluster 휒′ to find exactly 푘 휒′ centers Clustering Data Streams: Theory and Practice. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani. 2001 Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size • Assumes we know 푘 ahead of time • Sensitivity to Outliers • Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size vs: • Assumes we know 푘 ahead of time • Sensitivity to Outliers • Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size vs: • Assumes we know 푘 ahead of time vs: • Sensitivity to Outliers • Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size vs: • Assumes we know 푘 ahead of time vs: • Sensitivity to Outliers vs: • Converges to a local optimum of the objective Addressing the 푘 problem • Assuming we know 푘 ahead of time: vs: A parametric, probabilistic view of k-means Assume that the points 푥푖 were generated from a mixture of 푘 (multivariate) Gaussian distributions: 1 1 1 푇 −1 푁 푥 휇, Σ = 푒−2 푥−휇 Σ (푥−휇) 2휋퐷/2 |Σ|1/2 퐾 퐾 푝 푥 = 푘=1 휋푘푁(푥|휇푘, Σ) where 푘=1 휋푘 = 1 흁ퟐ 흁ퟏ Pattern Recognition and Machine Learning. Christopher M. Bishop. 2006 A parametric, probabilistic view of k-means • What is the most likely set of parameters θ = {{휋푘}, {휇푘}} that generated the points 푥푖 ? We want to maximize the likelihood 푝(푥|휃) => Solve by Expectation Maximization 흁ퟐ 흁ퟏ Xu and Wunsch, Survey of Clustering Algorithms DP-Means • A probabilistic, nonparametric view of 푘-means: • Assumes a “countably infinite” mixture of Gaussians, attempt to estimate the parameters of these Gaussians and the way they are combined • Leads to an intuitive, 푘-means-like algorithm 흁ퟑ 흁ퟒ 흁ퟐ 흁ퟔ 흁ퟏ 흁ퟓ Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis and Michael I. Jordan. 2012 DP-Means Given: 푥1, … , 푥푛, 휆 푘 Goal: minimize 푗=1 푖 푟푖푗 푥푖 − 휇푗 + 휆푘 Algorithm: Initialize 푘 = 1, 휇 = global mean Initialize 푟푖푗 = 1 for all 푖 = 1 … 푛 Repeat: • For each 푥푖: 2 • compute 푑푖푗 = ||푥푖 − 휇푗||2, 푗 = 1 … 푘 • If min 푑푖푗 > 휆, start a new cluster: set 푘 = 푘 + 1; 푗 • Otherwise, set 푟푖푗 = 푎푟푔푚푖푛푗푑푖푗 • Generate clusters 푐푖, … , 푐푘 based on 푟푖푗 • For each cluster compute 휇푗 Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis and Michael I. Jordan. 2012 Optimizing a different Global Objective: Spectral Clustering and Graph Cuts Spectral Clustering Key idea: use the eigenvectors of a graph Laplacian matrix 퐿(퐺) to cluster a graph 3 −1 −1 −1 0 1 퐺 2 −1 2 0 −1 0 퐿 퐺 = 퐷 − 푊 = −1 0 3 −1 −1 3 4 −1 −1 −1 4 −1 0 0 −1 −1 2 5 Partitioning Sparse Matrices with Eigenvectors of Graphs. Pothen, Alex, Horst D. Simon, and Kang-Pu Liu. 1990 Normalized Cuts and Image Segmentation. Jianbo Shi and Jitendra Malik. 2001 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering Algorithm Given: data points 푥푖 and a similarity function s(푥푖, 푥푖) between them, 푘 1. Construct (normalized) graph Laplacian 퐿 퐺 푉, 퐸 = 퐷 − 푊 2. Find the 푘 eigenvectors corresponding to the 푘 smallest eigenvalues of 퐿 3. Let U be the n × 푘 matrix of eigenvectors ′ 4. Use 푘-means to find 푘 clusters 퐶′ letting 푥푖 be the rows of U ′ 5. Assign data point 푥푖 to the 푗th cluster if 푥푖 was assigned to cluster 푗 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering Algorithm Given: data points 푥푖 and a similarity function s(푥푖, 푥푖) between them, 푘 Objective: minimize the ratio cut (or normalized cut) 1. Construct (normalized) graph Laplacian 퐿 퐺 푉, 퐸 = 퐷 − 푊 2. Find the 푘 eigenvectors corresponding to the 푘 smallest eigenvalues of 퐿 3. Let U be the n × 푘 matrix of eigenvectors ′ 4. Use 푘-means to find 푘 clusters 퐶′ letting 푥푖 be the rows of U ′ 5. Assign data point 푥푖 to the 푗th cluster if 푥푖 was assigned to cluster 푗 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Graph Cuts A graph cut of 퐺 푉, 퐸 is the sum of the weights on a set of edges 푆퐸 ⊆ 퐸 which cross partitions 퐴푖 of the graph 푊 퐴 , 퐴 = 푤 푖 푖 푥푖∈퐴푖,푥푗∈퐴푖 푖푗 1 푐푢푡 퐴 , … , 퐴 = 푊(퐴 , 퐴 ) 1 푘 2 푖 푖 푖=1,…,푘 Examples: 푐푢푡 퐴1, 퐴2 = 7 퐴 퐴2 푐푢푡 퐴1, 퐴2 = 2 2 퐴1 퐴1 Normalized Cuts 푐푢푡 퐴1, 퐴2 = 2 퐴2 퐴1 푐푢푡(퐴푖, 퐴푖) 푐푢푡(퐴 , 퐴 ) 푅푎푡푖표푐푢푡 퐴 , 퐴 , … , 퐴 = 푖 푖 1 2 푘 |퐴 | 푁푐푢푡 퐴1, 퐴2, … , 퐴푘 = 푖 푣표푙(퐴푖) 푖=1,…,푘 푖=1,…,푘 푅푎푡푖표푐푢푡 퐴1, 퐴2 = 0.226 푁푐푢푡 퐴1, 퐴2 = 0.063 퐴2 퐴1 퐴2 퐴1 Relationship between Eigenvectors and Graph Cuts For some subset of vertices 퐴, let: 푇 푛 푓 = 푓1, … , 푓푛 ∈ 푅 , such that: 퐴 /|퐴| if 푣푖 ∈ 퐴 푓푖 = − 퐴 /|퐴 | if 푣푖 ∈ 퐴 Then: 푓푇퐿푓 = 푉 ∙ 푅푎푡푖표퐶푢푡(퐴, 퐴 ), 푓 ⊥ ퟏ = 0, and 푓 = 푛 Therefore, minimizing RatioCut is equivalent to minimizing 푓푇퐿푓, subject to: 푓 ⊥ ퟏ = 0, 푓 = 푛, and 푓푖 as defined above A Tutorial on Spectral Clustering.
Recommended publications
  • A Survey of Hierarchical Clustering Algorithms the Journal Of
    M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240 The Journal of Mathematics and Computer Science Available online at http://www.TJMCS.com The Journal of Mathematics and Computer Science Vol .5 No.3 (2012) 229-240 A survey of hierarchical clustering algorithms Marjan Kuchaki Rafsanjani Department of Computer Science, Shahid Bahonar University of Kerman, Kerman, Iran [email protected] Zahra Asghari Varzaneh Department of Computer Science, Shahid Bahonar University of Kerman, Kerman, Iran [email protected] Nasibeh Emami Chukanlo Department of Computer Science, Shahid Bahonar University of Kerman, Kerman, Iran [email protected] Received: November 2012, Revised: December 2012 Online Publication: December 2012 Abstract Clustering algorithms classify data points into meaningful groups based on their similarity to exploit useful information from data points. They can be divided into categories: Sequential algorithms, Hierarchical clustering algorithms, Clustering algorithms based on cost function optimization and others. In this paper, we discuss some hierarchical clustering algorithms and their attributes, and then compare them with each other. Keywords: Clustering; Hierarchical clustering algorithms; Complexity. 2010 Mathematics Subject Classification: Primary 91C20; Secondary 62D05. 1. Introduction. Clustering is the unsupervised classification of data into groups/clusters [1]. It is widely used in biological and medical applications, computer vision, robotics, geographical data, and so on [2]. To date, many clustering algorithms have been developed. They can organize a data set into a number of groups /clusters. Clustering algorithms may be divided into the following major categories: Sequential algorithms, Hierarchical clustering algorithms, Clustering algorithms based on cost function optimization and etc [3].
    [Show full text]
  • The CURE for Class Imbalance
    The CURE for Class Imbalance B Colin Bellinger1( ), Paula Branco2, and Luis Torgo2 1 National Research Council of Canada, Ottawa, Canada [email protected] 2 Dalhousie University, Halifax, Canada {pbranco,ltorgo}@dal.ca Abstract. Addressing the class imbalance problem is critical for sev- eral real world applications. The application of pre-processing methods is a popular way of dealing with this problem. These solutions increase the rare class examples and/or decrease the normal class cases. However, these procedures typically only take into account the characteristics of each individual class. This segmented view of the data can have a nega- tive impact. We propose a new method that uses an integrated view of the data classes to generate new examples and remove cases. ClUstered REsampling (CURE) is a method based on a holistic view of the data that uses hierarchical clustering and a new distance measure to guide the sampling procedure. Clusters generated in this way take into account the structure of the data. This enables CURE to avoid common mistakes made by other resampling methods. In particular, CURE prevents the generation of synthetic examples in dangerous regions and undersamples safe, non-borderline, regions of the majority class. We show the effec- tiveness of CURE in an extensive set of experiments with benchmark domains. We also show that CURE is a user-friendly method that does not require extensive fine-tuning of hyper-parameters. Keywords: Imbalanced domains · Resampling · Clustering 1 Introduction Class imbalance is a problem encountered in a wide variety of important clas- sification tasks including oil spill, fraud detection, action recognition, text clas- sification, radiation monitoring and wildfire prediction [4,17,21,22,24,27].
    [Show full text]
  • An Improved CURE Algorithm Mingjuan Cai, Yongquan Liang
    An Improved CURE Algorithm Mingjuan Cai, Yongquan Liang To cite this version: Mingjuan Cai, Yongquan Liang. An Improved CURE Algorithm. 2nd International Conference on Intelligence Science (ICIS), Nov 2018, Beijing, China. pp.102-111, 10.1007/978-3-030-01313-4_11. hal-02118834 HAL Id: hal-02118834 https://hal.inria.fr/hal-02118834 Submitted on 3 May 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License An improved CURE algorithm Mingjuan Cai 1, Yongquan Liang 2 1College of Computer Science and Engineering and Shandong Province Key Laboratory of Wisdom Mine Information Technology, Shandong University of Science and Technology, Qingdao 266590, China 2College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China [email protected] ABSTRACT. CURE algorithm is an efficient hierarchical clustering algorithm for large data sets. This paper presents an improved CURE algorithm, named ISE-RS-CURE. The algorithm adopts a sample extraction algorithm combined with statistical ideas, which can reasonably select sample points according to different data densities and can improve the representation of sample sets. When the sample set is extracted, the data set is divided at the same time, which can help to reduce the time consumption in the non-sample set allocation process.
    [Show full text]
  • Dynamic Group Recommendation Based on the Attention Mechanism
    future internet Article Dynamic Group Recommendation Based on the Attention Mechanism Haiyan Xu , Yanhui Ding *, Jing Sun, Kun Zhao and Yuanjian Chen School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China; [email protected] (H.X.); [email protected] (J.S.); [email protected] (K.Z.); [email protected] (Y.C.) * Correspondence: [email protected]; Tel.: +86-139-0531-6824 Received: 29 July 2019; Accepted: 16 September 2019; Published: 17 September 2019 Abstract: Group recommendation has attracted significant research efforts for its importance in benefiting group members. The purpose of group recommendation is to provide recommendations to group users, such as recommending a movie to several friends. Group recommendation requires that the recommendation should be as satisfactory as possible to each member of the group. Due to the lack of weighting of users in different items, group decision-making cannot be made dynamically. Therefore, in this paper, a dynamic recommendation method based on the attention mechanism is proposed. Firstly, an improved density peak clustering (DPC) algorithm is used to discover the potential group; and then the attention mechanism is adopted to learn the influence weight of each user. The normalized discounted cumulative gain (NDCG) and hit ratio (HR) are adopted to evaluate the validity of the recommendation results. Experimental results on the CAMRa2011 dataset show that our method is effective. Keywords: group recommendation; density peak clustering; attention mechanism 1. Introduction With the rapid development of information technology, the recommendation system has become an important tool for people to solve information overload issues. The recommendation system has been widely used in many online information systems, such as social media sites, E-commerce platforms, etc., to help users choose products [1–3] that meet their needs.
    [Show full text]
  • An Improvement for DBSCAN Algorithm for Best Results in Varied Densities
    Computer Engineering Department Faculty of Engineering Deanery of Higher Studies The Islamic University-Gaza Palestine An Improvement for DBSCAN Algorithm for Best Results in Varied Densities Mohammad N. T. Elbatta Supervisor Dr. Wesam M. Ashour A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Gaza, Palestine (September, 2012) 1433 H ii stnkmegnelnonkcA Apart from the efforts of myself, the success of any project depends largely on the encouragement and guidelines of many others. I take this opportunity to express my gratitude to the people who have been instrumental in the successful completion of this project. I would like to thank my parents for providing me with the opportunity to be where I am. Without them, none of this would be even possible to do. You have always been around supporting and encouraging me and I appreciate that. I would also like to thank my brothers and sisters for their encouragement, input and constructive criticism which are really priceless. Also, special thanks goes to Dr. Homam Feqawi who did not spare any effort to review and audit my thesis linguistically. My heartiest gratitude to my wonderful wife, Nehal, for her patience and forbearance through my studying and preparing this study. I would like to express my sincere gratitude to my advisor Dr. Wesam Ashour for the continuous support of my master study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my master study.
    [Show full text]
  • Download (PDF)
    DX.Y “TITLE OF THE DELIVERABLE” Ref. Ares(2021)1046119 - 06/02/2021 Version 1.0 Document Information Contract Number 780681 Project Website https://legato-project.eu/ Contractual Deadline Dissemination Level D4.4 “REPORT ON ENERGY-EFFICIENCY EVALUATIONS AND OPTIMIZATIONS FOR ENERGY-EFFICIENT, SECURE, Nature RESILIENT TASK-BASED PROGRAMMING MODEL AND COMPILER EXTENSIONS” Author DX.Y “TITLEVersion 2 OF THE DELIVERABLE” Contributors Document Information Version 1.0 Reviewers Contract Number 780681 Project Website https://legato-project.eu/ Contractual Deadline 30 November 2020 Dissemination Level Public Document InformationNature Report Author Marcelo Pasin (UNINE) Contributors Isabelly Rocha (UNINE), Christian Göttel (UNINE), Valerio The LEGaTO project has receivedSchiavoni (UNINE), Pascal Felber (UNINE), Gabriel Fernandez funding from the European Union’s Horizon 2020 (TUD), Xavier Martorell (BSC), Leonardo Bautista-Gomez (BSC), research and innovationMustafa Abduljabbar (CHALMERS), Oron Port (TECHNION), programme under the Grant Agreement No 780681 Contract Number 780681Tobias Becker (MAXELER), Alexander Cramb (MAXELER) Reviewers Madhavan Manivannan (CHALMERS), Daniel Ödman (MIS), Elaheh Malekzadeh (MIS), Chistian von Schultz (MIS), Hans Salomonsson (MIS) Project Website https://legato-project.eu/ The LEGaTO project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement No 780681. Contractual Deadline DX.Y Version 1.0 1 / 4 D4.4 Version 2 1/ 62 Dissemination Level Nature
    [Show full text]
  • Hierarchical Clustering Algorithm for Big Data Using Hadoop and Mapreduce
    Hierarchical Clustering Algorithm for Big Data using Hadoop and Mapreduce Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted By Piyush Lathiya (801431011) Under the supervision of: Dr. Rinkle Rani Associate Professor COMPUTER SCIENCE AND ENGINEERING DEPARTMENT THAPAR UNIVERSITY PATIALA – 147004 June 2016 i ii Abstract Mining of massive data sets is the need of the hour in present computer science industry. The exponential growth in the number of users on internet and volume of available data force research to think about efficient approach to store data and analyze useful patterns out of it. Extracting useful information out of massive data and process them in less span of time has become crucial part of Data mining. There are many approach exist to cluster data objects based on similarity. CURE (Clustering Using Representatives) is very useful hierarchical algorithm which has ability to identify cluster of arbitrary shape and able to identify outliers. However traditional CURE algorithm is based on processing in single machine hence can’t cluster large amount of data in efficient way. In this thesis, CURE algorithm is proposed along with Distributed Environment using Hadoop. To process huge amount of data and to extract useful patterns out of it, distributed environment is the efficient solution so clustering of data objects is performed with the help of Mapreduce Programming model. One of the other advantage of CURE algorithm is to detect outlier points and removed it from further clustering process and improve quality of clusters. The major focus of this thesis has been exploring new approach to cluster data objects using CURE clustering algorithm with the help of Hadoop distributed environment and explore effect of different parameters in outlier detection.
    [Show full text]
  • Statistical Modeling Machine Learning
    資料偵探的倚天劍 —資管人如何處理大數據與 AI 課題 Yihuang Kang Data-Driven Discovery (D3) Lab @ NSYSU 1 Instructor • Yihuang K. Kang (康藝晃), PhD. 為中山大學 資訊管理學系助理教授。曾任美國匹茲堡大 學醫學中心資深程式設計師與分析師。專長 為統計機器學習、整合分析平台導入、電子 化健康醫療數據分析、與遠距醫療系統開發。 • • 近年回台任中山大學資訊管理學系助理教授,協助中山大學建 置整合數據分析平台,並結合專長教授數據分析與機器學習相 關課程。為中山大學智慧電子商務研究中心與跨領域及數據科 學研究中心成員。資策會教育訓練(機器學習)講師。已有多項 研究發表於頂尖的資料探勘與健康醫療研究期刊。擁有多個實 務資訊系統管理、程式設計、與商業分析專業認證。 2 Outline • Brief Introduction to Data Science • Fundamental of Data Analytics • Programming vs. Machine Learning • Towards Interpretable AI • From Business Intelligence to Unified Analytics • Build Our Data (Dream) Team! 3 What is "Data Science" • “Data science is a multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.” Drew Conway’s Venn diagram of data science 4 A Brief History of Data Science 5 6 The Rise of Data Science 7 The Rise of "Data Scientists" “Data Scientist is The Sexiest Job of the 21st Century” —T. Davenport & D.J. Patil, Harvard Business Review Source: http://www.techjuice.pk/how-to-become-a-data-scientist-for-free/ 8 The "Dilemma" of the Data Scientists Source: http://www.kdnuggets.com/2016/08/cartoon-data-scientist-sexiest-job-21st-century.html 9 “Data” Careers Data Data Data Data Manager Analyst Engineer Scientist Business Acumen Data Management Math Programming Statistics 10 Big Data • According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB, which increased by nearly 9 times within five years. By 2020, it’s estimated 35ZB. • Challenges: big data management and analysis Sources: Big Data: A Survey 11 What is "Big Data" • Many people have defined "Big Data" with 3Vs, 4Vs, 5Vs…, many Vs! • My definition is: "Too much and complicated data to be processed by a single machine with reasonable time or resources".
    [Show full text]
  • Intel® Xeon® Processors Projector
    Intel Korea Sep 2017 © 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Legal Notices & disclaimers This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
    [Show full text]
  • Privacy-Preserving Clustering Using Representatives Over Arbitrarily Partitioned Data∗
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 4, No. 9, 2013 Privacy-Preserving Clustering Using Representatives over Arbitrarily Partitioned Data∗ Yu Li, Sheng Zhong Computer Science and Engineering Department State University of New York at Buffalo Amherst, NY 14260, U.S.A Email: fyli32,[email protected] Abstract—The challenge in privacy-preserving data mining is targeted marketing, people are more willing to communicate avoiding the invasion of personal data privacy. Secure computa- with others who share the same interests, they are cautious tion provides a solution to this problem. With the development of about revealing their private information to the public. Privacy- this technique, fully homomorphic encryption has been realized preserving data mining is the only way to solve this problem after decades of research; this encryption enables the computing and to not invade the privacy of customers. Another example and obtaining results via encrypted data without accessing any is in the bioinformatics field when two medical researchers plaintext or private key information. In this paper, we propose a privacy-preserving clustering using representatives (CURE) want to study a certain disease caused by a gene. They cannot algorithm over arbitrarily partitioned data using fully homomor- share these data with each other due to the privacy rules phic encryption. Our privacy-preserving CURE algorithm allows established by HIPAA. But privacy-preserving data mining cooperative computation without revealing users’ individual data. using homomorphic encryption can determine the relationship The method used in our algorithm enables the data to be between these data without revealing any information to the arbitrarily distributed among different parties and to receive other concerned parties.
    [Show full text]
  • Robust Clustering Algorithms
    ROBUST CLUSTERING ALGORITHMS A Thesis Presented to The Academic Faculty by Pramod Gupta In Partial Fulfillment of the Requirements for the Degree Masters in Computer Science in the School of Computer Science Georgia Institute of Technology May 2011 Copyright © 2011 by Pramod Gupta ROBUST CLUSTERING ALGORITHMS Approved by: Prof. Maria-Florina Balcan, Advisor School of Computer Science Georgia Institute of Technology Prof. Santosh Vempala School of Computer Science Georgia Institute of Technology Prof. Alexander Gray School of Computer Science Georgia Institute of Technology Date Approved: March, 29, 2011 ACKNOWLEDGEMENTS This thesis would not have been possible without the guidance and the help of several individuals who in one way or another contributed and extended their valuable assistance in the preparation and completion of this work. First and foremost, my utmost gratitude to Prof. Nina Balcan, whose encouragement, guidance and support from the initial to the final level enabled me to develop an understand- ing of the subject. She has been an immense inspiration and I thank her for her patience, motivation, enthusiasm, immense knowledge and steadfast encouragement as I hurdle all the obstacles in the completion this work. I would like to thank the rest of my thesis committee: Prof. Santosh Vempala and Prof. Alex Gray for their encouragement and insightful comments throughout the course of this work. I am extremely grateful to Prof. Dana Randall, an insightful and amazing teacher, who helped me better understand the theoretical thought process. The Algorithms & Randomness Center, that allowed us to use their machines to run our experiments. James Kriigel at the Technology Services Organization, College of Computing, for his help in setting us up with the necessary technical resources.
    [Show full text]
  • Unsupervised Machine Learning and Data Mining (Course Slides)
    Unsupervised Machine Learning and Data Mining (Course Slides) Javier Béjar AMLT Master in Artificial Intelligence Course 2016/2017 Fall Semester Esta obra está bajo una licencia Reconocimiento-NoComercial-CompartirIgual de Creative Commons. Para ver una copia de esta licencia, visite http://creativecommons.org/licenses/by-nc-sa/2.5/es/ o envie una carta a Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. Contents 1 Introduction to Knowledge Discovery1 2 Data Preprocessing 19 3 Data Clustering 57 4 Cluster Validation 113 5 Clustering of Large Datasets 137 6 Semisupervised Clustering 155 7 Association Rules 169 8 Mining Structured Data 197 1 2 Preface These are the slides for the first half of the course Advanced Machine Learning Techniques from the master on Artificial Intelligence of the Barcelona Computer Science School (Facultat d’Informàtica de Barcelona), Technical University of Catalonia (UPC, BarcelonaTech). This slides are used in class to present the topics of the course and have been prepared using the papers and book references that you can find in the course website http://www.cs.upc.edu/ ~bejar/amlt/amlt.html). This document is a complement so you can prepare the classes and use it as a reference, but it is not a substitute for the classes or the material from the webpage of the course. Javier Béjar Barcelona, 2016 3 4 1 Introduction to Knowledge Discovery 1 Knowledge Discovery in Databases Javier B´ejar cbea CS - MIA AMLT - 2016/2017 Javier B´ejar cbea (CS - MIA) Knowledge Discovery in Databases AMLT -
    [Show full text]