Enabling Parallelism and Optimizations in Data Mining Algorithms for Power-Law Data

ENABLING PARALLELISM AND OPTIMIZATIONS IN DATA MINING ALGORITHMS FOR POWER-LAW DATA A Dissertation Presented to The Academic Faculty By Ankush Mandal In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology August 2020 Copyright c Ankush Mandal 2020 ENABLING PARALLELISM AND OPTIMIZATIONS IN DATA MINING ALGORITHMS FOR POWER-LAW DATA Approved by: Dr. Vivek Sarkar, Advisor Dr. Santosh Pande School of Computer Science School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Dr. Anshumali Shrivastava, Dr. Richard Vuduc Co-advisor School of Computational Science Department of Computer Science and Engineering Rice University Georgia Institute of Technology Dr. Hyesoon Kim Date Approved: June 30, 2020 School of Computer Science Georgia Institute of Technology ACKNOWLEDGEMENTS First and foremost, I would like to convey my deepest gratitude to my advisor Dr. Vivek Sarkar for his mentorship, encouragement, and support throughout my graduate study at Georgia Institute of Technology. His guidance and enthusiasm in research were invaluable to my development as a graduate student. He has been my source of inspiration whenever I faced difficulties. I owe him so much for giving me the opportunity to join Habanero Extreme Scale Software Research Group. I would like to express the greatest degree of appreciation to my other mentor Dr. Anshumali Shrivastava for giving me the opportunity to work with him. His expertise and insights were indispensable to the success of my works. I would like to thank the rest of my committee members - Dr. Hyesoon Kim, Dr. Santosh Pande, and Dr. Richard Vuduc for their valuable feedback on my work. I am grateful to all the members of the Habanero Extreme Scale Software Research group for their help and feedback in my research work. My study at Georgia Tech would not have been as entertaining and educational without them. Additionally, I have been very fortunate to have great friends in my graduate life. I am very thankful for all their support and encouragement. Special thanks to Abhishek, Ananya, Arijit, Arkabandhu, Arpita, Arunim, Debarshi, Hamim, Himadri, Nilanjan, Pranabendu, Prasanth, Pushan, Puspita, Rabimba, Rohan, Sagnik, Saikat, Sandip, Shams, Shantanu, Suvadip, Siam, Sourav, Sriraj, Trijit. Finally, I would like to conclude by acknowledging that I am indebted to my parents Kalpana Mandal and Durgapada Mandal, and my sister Ankita Mandal for their uncondi- tional love. They deserve the maximum credit for this work. They continuously supported and encouraged me to pursue my dreams. Without them standing by my side, I could not have come this far. TABLE OF CONTENTS List of Tables . viii List of Figures . ix Summary . xii Chapter 1: Introduction . 1 1.1 Thesis Statement . 5 Chapter 2: Matryoshka: Frequency Estimation with GPU Parallelism for Skewed Data . 6 2.1 Frequency Estimation Problem . 9 2.2 Sketch - Overview & Parallelism . 10 2.2.1 Overview of GPU Parallelism . 12 2.2.2 Problem: Sketch on GPU . 12 2.3 Our Proposal: Matryoshka . 14 2.3.1 Matryoshka - Hierarchical Exploitation of Skewness . 15 2.3.2 Head-first Scan - Light Weight Heavy Hitter Detection . 17 2.4 Theoretical Analysis . 19 2.4.1 Matryoshka Sketching Analysis . 19 2.4.2 Head-first Scan Analysis . 23 iv 2.5 Implementation . 25 2.6 Evaluations . 26 2.6.1 Experimental Setup . 26 2.6.2 Performance Results . 29 2.6.3 Accuracy Comparison with CMS . 34 2.6.4 Effects of Parameters . 35 2.6.5 Performance Analysis on GPU . 36 2.7 Related Works . 38 Chapter 3: Topkapi: Frequent Elements Finding with Shared and Distributed Memory Parallelism . 40 3.1 Problem Statement . 42 3.1.1 φ-Approximate Heavy Hitters . 43 3.2 Previous Solutions & Their Limitations . 44 3.2.1 Exact Algorithms . 44 3.2.2 Approximate Algorithms . 45 3.3 Our Proposal: Topkapi . 48 3.3.1 Intuition . 48 3.3.2 Topkapi: Algorithm Descriptions . 49 3.3.3 Topkapi: Properties . 50 3.3.4 Practical Considerations . 52 3.4 Topkapi: Theoretical Analysis . 52 3.5 Implementation . 53 3.5.1 Multi-core Parallelism . 54 v 3.5.2 Distributed Parallelism . 54 3.5.3 Parallelizing Baselines: Frequent Algorithms and Count-min Sketch 56 3.6 Evaluations . 57 3.6.1 Code and Experimental Setup . 57 3.6.2 Performance Metrics . 57 3.6.3 Datasets . 58 3.6.4 Performance Comparison with Approximate Methods . 59 3.6.5 Precision for Reported top-K ..................... 64 3.6.6 Performance Comparison with Exact Method . 65 Chapter 4: Word Embedding with Efficient Fine Grain Parallelism . 67 4.1 Background on Word2Vec . 70 4.1.1 Word2Vec: Learning Model . 70 4.1.2 Word2Vec Algorithms . 72 4.2 Shortcomings of Current Solutions . 75 4.3 Proposed Approach . 77 4.3.1 NinjaUpdate . 78 4.3.2 FrequentSkip . 86 4.4 Evaluation . 88 4.4.1 Experimental Setup . 88 4.4.2 Performance Comparison . 90 4.4.3 Empirical Analysis of NinjaVec . 93 Chapter 5: Conclusions . 98 vi References . 108 vii LIST OF TABLES 3.1 Precision Comparison between Approximate Methods . 64 4.1 Accuracy for word similarity (WS353 dataset) . 93 4.2 Accuracy for word analogy (Google analogy dataset) . 93 viii LIST OF FIGURES 1.1 Frequency distributions of elements in different real-world datasets . 3 2.1 Naive CMS Parallelization Efficiency on GPU . 9 2.2 Sketch Data Structure . 10 2.3 Parallel Sketch on Multi-core CPU . 11 2.4 Parallel Sketch on GPU . 13 2.5 Matryoshka Sketching Strategy . 16 2.6 Frequency distributions of items in different real datasets (presented again for ease of reference) . 28 2.7 Performance comparison for updates on Zipf datasets . 31 2.8 Performance comparison for query on Zipf datasets . 32 2.9 Performance comparison for updates on real datasets . 33 2.10 Performance comparison for query on real datasets . 34 2.11 Average relative error comparison on Zipf datasets . 35 2.12 Average relative error comparison on real datasets . 35 2.13 Effect of b ................................... 36 2.14 Effect of l ................................... 36 2.15 Percentage improvement over reference CMS GPU in GPU performance metrics . 37 ix 3.1 Performance comparison with FA and CMS for 16GB data. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. 58 3.2 Performance comparison with FA and CMS for 128GB data. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. 59 3.3 Performance comparison with FA and CMS for varying number of threads. Data Size=16GB and Number of Nodes=1. Used a single node with 32 cores from four IBM Power R 7 chip. 60 3.4 Performance comparison with FA and CMS for varying data size on 8 nodes. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. 61 3.5 Performance comparison with FA and CMS for high number of threads (32 and 64) in distributed setting. Used a cluster of IBM Power R 7 processors where each node has 32 cores from four processor chips. 61 3.6 Execution time break down for Topkapi, FA, and CMS for 4 nodes and 1GB data size. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. 62 3.7 Performance comparison with FA and CMS for K=50, 200 on 16GB data. Number of threads per node is 8. 63 3.8 Performance comparison with Exact Method - Spark wordcount() + parallel sort() for 16GB and 128GB data. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. 65 3.9 Performance comparison with Exact Method - Spark wordcount() + parallel sort() for varying data size on 8 nodes. Number of threads per node is 8. Used a cluster of Intel R Westmere processors with each node having 12 cores. 66 4.1 Strategies used in different Word2Vec algorithms . 68 4.2 Skip-gram model architecture . 71 4.3 Strategies used in different Word2Vec algorithms (presented again for ease of reference) . 74 x 4.4 Statistics for the first GEMM call in pWord2vec over One Billion Words dataset 76 4.5 Workflow for NinjaUpdate .......................... 78 4.6 SGD code . 79 4.7 Optimized SGD code with specialization . 85 4.8 Outline of code generated for multi-versioning . 86 4.9 Comparison of speedups achieved in SGD . 91 4.10 Comparison of speedups achieved in training time . 92 4.11 Performance gain for gradient update step in different scenarios . 94 4.12 Execution time and accuracy with varying threshold for FrequentSkip on One Billion Words dataset . 95 4.13 Performance scaling with number of threads on One Billion Words dataset . 96 4.14 Performance with varying number of case specialization based on the frequency on One Billion Words dataset . 96 xi SUMMARY Today’s data mining tasks.

Enabling Parallelism and Optimizations in Data Mining Algorithms for Power-Law Data

GPTPU: Accelerating Applications Using Edge Tensor Processing Units Kuan-Chieh Hsu and Hung-Wei Tseng University of California, Riverside {Khsu037, Htseng}@Ucr.Edu

Amd Filed: February 24, 2009 (Period: December 27, 2008)

Quad-Core Catamount and R&D in Multi-Core Lightweight Kernels

ADVANCED MICRO DEVICES, INC. (Exact Name of Registrant As Specified in Its Charter)

AMD's Early Processor Lines, up to the Hammer Family (Families K8

N51tp/Te AMD PUMA Rev 0.6 2008/Nov/05

AMD Zen Rohin, Vijay, Brandon Outline

AMD Details Next-Generation Platform for Notebook Pcsamd Details Next-Generation Platform for Notebook Pcs 18 May 2007

Intel Corporation, Founded 1968, Is the Largest Microprocessor Company in the World, with the Greatest Overall Share of the Microprocessor Market Worldwide

PUMA: a Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference

PUMA” Microarchitecture PUMA = Programmable Ultra-Efficient Memristor-Based Accelerator

AMD ENERGY STAR Server Energy Efficiency Technology