Ray: a Distributed Framework for Emerging AI Applications
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Recursive Definitions Across the Footprint That a Straight- Line Program Manipulated, E.G
natural proofs for verification of dynamic heaps P. MADHUSUDAN UNIV. OF ILLINOIS AT URBANA-CHAMPAIGN (VISITING MSR, BANGALORE) WORKSHOP ON MAKING FORMAL VERIFICATION SCALABLE AND USEABLE CMI, CHENNAI, INDIA JOINT WORK WITH… Xiaokang Qiu .. graduating! Gennaro Andrei Pranav Garg Parlato Stefanescu GOAL: BUILD RELIABLE SOFTWARE Build systems with proven reliability and security guarantees. (Not the same as finding bugs!) Deductive verification with automatic theorem proving • Floyd/Hoare style verification • User supplied modular annotations (pre/post cond, class invariants) and loop invariants • Resulting verification conditions are derived automatically for each linear program segment (using weakest-pre/strongest-post) Verification conditions are then proved valid using mostly automatic theorem proving (like SMT solvers) TARGET: RELIABLE SYS TEMS SOFTWARE Operating Systems Angry birds Browsers Excel VMs Mail clients … App platforms Systems Software Applications Several success stories: • Microsoft Hypervisor verification using VCC [MSR, EMIC] • A secure foundation for mobile apps [@Illinois, ASPLOS’13] A SECURE FOUNDATION FOR MOBILE APPS In collaboration with King’s systems group at Illinois First OS architecture that provides verifiable, high-level abstractions for building mobile applications. Application state is private by default. • Meta-data on files controls access by apps; only app with appropriate permissions can access files • Every page is encrypted before sending to the storage service. • Memory isolation between apps • A page fault handler serves a file- backed page for a process, the fille has to be opened by the same process. • Only the current app can write to the screen buffer. • IPC channel correctness • … VERIFICATION TOOLS ARE MATURE And works! ENOUGH TO BUILD RELIABLE S/W. -
Type 2 Diabetes Prediction Using Machine Learning Algorithms
Type 2 Diabetes Prediction Using Machine Learning Algorithms Parisa Karimi Darabi*, Mohammad Jafar Tarokh Department of Industrial Engineering, K. N. Toosi University of Technology, Tehran, Iran Abstract Article Type: Background and Objective: Currently, diabetes is one of the leading causes of Original article death in the world. According to several factors diagnosis of this disease is Article History: complex and prone to human error. This study aimed to analyze the risk of Received: 15 Jul 2020 having diabetes based on laboratory information, life style and, family history Revised: 1 Aug 2020 with the help of machine learning algorithms. When the model is trained Accepted: 5 Aug 2020 properly, people can examine their risk of having diabetes. *Correspondence: Material and Methods: To classify patients, by using Python, eight different machine learning algorithms (Logistic Regression, Nearest Neighbor, Decision Parisa Karimi Darabi, Tree, Random Forest, Support Vector Machine, Naive Bayesian, Neural Department of Industrial Network and Gradient Boosting) were analyzed. Were evaluated by accuracy, Engineering, K. N. Toosi University sensitivity, specificity and ROC curve parameters. of Technology, Tehran, Iran Results: The model based on the gradient boosting algorithm showed the best [email protected] performance with a prediction accuracy of %95.50. Conclusion: In the future, this model can be used for diagnosis diabete. The basis of this study is to do more research and develop models such as other learning machine algorithms. Keywords: Prediction, diabetes, machine learning, gradient boosting, ROC curve DOI : 10.29252/jorjanibiomedj.8.3.4 people die every year from diabetes disease. Introduction By 2045, it will reach 629 million (1). -
Quick Install for AWS EMR
Quick Install for AWS EMR Version: 6.8 Doc Build Date: 01/21/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc. EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION. For third-party license information, please select About Trifacta from the Help menu. 1. Release Notes . 4 1.1 Changes to System Behavior . 4 1.1.1 Changes to the Language . 4 1.1.2 Changes to the APIs . 18 1.1.3 Changes to Configuration 23 1.1.4 Changes to the Object Model . 26 1.2 Release Notes 6.8 . 30 1.3 Release Notes 6.4 . 36 1.4 Release Notes 6.0 . 42 1.5 Release Notes 5.1 . 49 2. Quick Start 55 2.1 Install from AWS Marketplace with EMR . 55 2.2 Upgrade for AWS Marketplace with EMR . 62 3. Configure 62 3.1 Configure for AWS . 62 3.1.1 Configure for EC2 Role-Based Authentication . 68 3.1.2 Enable S3 Access . 70 3.1.2.1 Create Redshift Connections 81 3.1.3 Configure for EMR . -
Delft University of Technology Arrowsam In-Memory Genomics
Delft University of Technology ArrowSAM In-Memory Genomics Data Processing Using Apache Arrow Ahmad, Tanveer; Ahmed, Nauman; Peltenburg, Johan; Al-Ars, Zaid DOI 10.1109/ICCAIS48893.2020.9096725 Publication date 2020 Document Version Accepted author manuscript Published in 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) Citation (APA) Ahmad, T., Ahmed, N., Peltenburg, J., & Al-Ars, Z. (2020). ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow. In 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS): Proceedings (pp. 1-6). [9096725] IEEE . https://doi.org/10.1109/ICCAIS48893.2020.9096725 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. -
Scalable Fault-Tolerant Elastic Data Ingestion in Asterixdb
UC Irvine UC Irvine Electronic Theses and Dissertations Title Scalable Fault-Tolerant Elastic Data Ingestion in AsterixDB Permalink https://escholarship.org/uc/item/9xv3435x Author Grover, Raman Publication Date 2015 License https://creativecommons.org/licenses/by-nd/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Scalable Fault-Tolerant Elastic Data Ingestion in AsterixDB DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Information and Computer Science by Raman Grover Dissertation Committee: Professor Michael J. Carey, Chair Professor Chen Li Professor Sharad Mehrotra 2015 © 2015 Raman Grover DEDICATION To my wonderful parents... ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES viii ACKNOWLEDGMENTS ix CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xi 1 Introduction 1 1.1 Challenges in Data Feed Management . .3 1.1.1 Genericity and Extensibility . .3 1.1.2 Fetch-Once Compute-Many Model . .5 1.1.3 Scalability and Elasticity . .5 1.1.4 Fault Tolerance . .7 1.2 Contributions . .8 1.3 Organization . .9 2 Related Work 10 2.1 Stream Processing Engines . 10 2.2 Data Routing Engines . 11 2.3 Extract Transform Load (ETL) Systems . 12 2.4 Other Systems . 13 2.4.1 Flume . 14 2.4.2 Chukwa . 15 2.4.3 Kafka . 16 2.4.4 Sqoop . 17 2.5 Summary . 17 3 Background and Preliminaries 19 3.1 AsterixDB . 19 3.1.1 AsterixDB Architecture . 20 3.1.2 AsterixDB Data Model . 21 3.1.3 Querying Data . -
Anomaly Detection and Identification Scheme for VM Live Migration in Cloud Infrastructure✩
Future Generation Computer Systems 56 (2016) 736–745 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Anomaly detection and identification scheme for VM live migration in cloud infrastructureI Tian Huang a, Yongxin Zhu a,∗, Yafei Wu a, Stéphane Bressan b, Gillian Dobbie c a School of Microelectronics, Shanghai Jiao Tong University, China b School of Computing, National University of Singapore, Singapore c Department of Computer Science, University of Auckland, New Zealand h i g h l i g h t s • We extend Local Outlier Factors to detect anomalies in time series and adapt to smooth environmental changes. • We propose Dimension Reasoning LOF (DR-LOF) that can point out the ``most anomalous'' dimension of the performance profile. DR-LOF provides clues for administrators to pinpoint and clear the anomalies. • We incorporate DR-LOF and Symbolic Aggregate ApproXimation (SAX) measurement as a scheme to detect and identify the anomalies in the progress of VM live migration. • In experiments, we deploy a typical computing application on small scale clusters and evaluate our anomaly detection and identification scheme by imposing two kinds of common anomalies onto VMs. article info a b s t r a c t Article history: Virtual machines (VM) offer simple and practical mechanisms to address many of the manageability Received 3 March 2015 problems of leveraging heterogeneous computing resources. VM live migration is an important feature Received in revised form of virtualization in cloud computing: it allows administrators to transparently tune the performance 11 May 2015 of the computing infrastructure. However, VM live migration may open the door to security threats. -
The Platform Inside and out Release 0.8
The Platform Inside and Out Release 0.8 Joshua Patterson – GM, Data Science RAPIDS End-to-End Accelerated GPU Data Science Data Preparation Model Training Visualization Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory 2 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU 3 Data Movement and Transformation The bane of productivity and performance APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert Copy & Convert APP A GPU Data APP A Load Data APP A 4 Data Movement and Transformation What if we could keep data on the GPU? APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert Copy & Convert APP A GPU Data APP A Load Data APP A 5 Learning from Apache Arrow ● Each system has its own internal memory format ● All systems utilize the same memory format ● 70-80% computation wasted on serialization and deserialization ● No overhead for cross-system communication ● Similar functionality implemented in multiple projects ● Projects can share functionality (eg, Parquet-to-Arrow reader) From Apache Arrow -
A Decidable Logic for Tree Data-Structures with Measurements
A Decidable Logic for Tree Data-Structures with Measurements Xiaokang Qiu and Yanjun Wang Purdue University fxkqiu,[email protected] Abstract. We present Dryaddec, a decidable logic that allows reason- ing about tree data-structures with measurements. This logic supports user-defined recursive measure functions based on Max or Sum, and re- cursive predicates based on these measure functions, such as AVL trees or red-black trees. We prove that the logic's satisfiability is decidable. The crux of the decidability proof is a small model property which al- lows us to reduce the satisfiability of Dryaddec to quantifier-free lin- ear arithmetic theory which can be solved efficiently using SMT solvers. We also show that Dryaddec can encode a variety of verification and synthesis problems, including natural proof verification conditions for functional correctness of recursive tree-manipulating programs, legal- ity conditions for fusing tree traversals, synthesis conditions for con- ditional linear-integer arithmetic functions. We developed the decision procedure and successfully solved 220+ Dryaddec formulae raised from these application scenarios, including verifying functional correctness of programs manipulating AVL trees, red-black trees and treaps, check- ing the fusibility of height-based mutually recursive tree traversals, and counterexample-guided synthesis from linear integer arithmetic specifi- cations. To our knowledge, Dryaddec is the first decidable logic that can solve such a wide variety of problems requiring flexible combination of measure-related, data-related and shape-related properties for trees. 1 Introduction Logical reasoning about tree data-structures has been needed in various applica- tion scenarios such as program verification [32,24,26,42,49,4,14], compiler opti- mization [17,18,9,44] and webpage layout engines [30,31]. -
Data Intensive Computing with Linq to Hpc Technical
DATA INTENSIVE COMPUTING WITH LINQ TO HPC TECHNICAL OVERVIEW Ade Miller, Principal Program Manager, LINQ to HPC Agenda Introduction to LINQ to HPC Using LINQ to HPC Systems Management Integrating with Other Data Technologies The Data Spectrum One extreme is analyzing raw, unstructured data. The analyst does not know exactly what the data contains, nor LINQ to HPC what cube would be justified. The analyst needs to do ad-hoc analyses that may never be run again. Another extreme is analytics targeting a traditional data warehouse. The analyst Parallel Data knows the cube he or she wants to build, Warehouse and the analyst knows the data sources. What kind of Data? Large Data Volume . 100s of TBs to 10s of PBs New Questions & Non-Traditional data Types New Technologies New Insights . Unstructured & Semi structured . How popular is my product? . Weak relational schema . Distributed Parallel Processing . What is the best ad to serve? . Text, Images, Videos, Logs Frameworks . Is this a fraudulent transaction? . Easy to Scale on commodity hardware . MapReduce-style programming models New Data Sources . Sensors & Devices . Traditional applications . Web Servers . Public data Overview MOVING THE DATA TO THE COMPUTE So how does it work? FIRST, STORE THE DATA Data Intensive Computing with HPC Server Windows HPC Server 2008 R2 Service Pack 2 So how does it work? SECOND, TAKE THE PROCESSING TO THE DATA Data Intensive Computing with HPC Server var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access -
Arrow: Integration to 'Apache' 'Arrow'
Package ‘arrow’ September 5, 2021 Title Integration to 'Apache' 'Arrow' Version 5.0.0.2 Description 'Apache' 'Arrow' <https://arrow.apache.org/> is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. Depends R (>= 3.3) License Apache License (>= 2.0) URL https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/ BugReports https://issues.apache.org/jira/projects/ARROW/issues Encoding UTF-8 Language en-US SystemRequirements C++11; for AWS S3 support on Linux, libcurl and openssl (optional) Biarch true Imports assertthat, bit64 (>= 0.9-7), methods, purrr, R6, rlang, stats, tidyselect, utils, vctrs RoxygenNote 7.1.1.9001 VignetteBuilder knitr Suggests decor, distro, dplyr, hms, knitr, lubridate, pkgload, reticulate, rmarkdown, stringi, stringr, testthat, tibble, withr Collate 'arrowExports.R' 'enums.R' 'arrow-package.R' 'type.R' 'array-data.R' 'arrow-datum.R' 'array.R' 'arrow-tabular.R' 'buffer.R' 'chunked-array.R' 'io.R' 'compression.R' 'scalar.R' 'compute.R' 'config.R' 'csv.R' 'dataset.R' 'dataset-factory.R' 'dataset-format.R' 'dataset-partition.R' 'dataset-scan.R' 'dataset-write.R' 'deprecated.R' 'dictionary.R' 'dplyr-arrange.R' 'dplyr-collect.R' 'dplyr-eval.R' 'dplyr-filter.R' 'expression.R' 'dplyr-functions.R' 1 2 R topics documented: 'dplyr-group-by.R' 'dplyr-mutate.R' 'dplyr-select.R' 'dplyr-summarize.R' -
Cloud Computing Paradigms for Pleasingly Parallel Biomedical
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu, Jong Youl Choi, Seung-Hee Bae, Judy Qiu School of Informatics and Computing / Pervasive Technology Institute Indiana University, Bloomington. {tgunarat, taklwu, jychoi, sebae, xqiu}@indiana.edu Abstract Cloud computing offers exciting new approaches for scientific computing that leverages the hardware and software investments on large scale data centers by major commercial players. Loosely coupled problems are very important in many scientific fields and are on the rise with the ongoing move towards data intensive computing. There exist several approaches to leverage clouds & cloud oriented data processing frameworks to perform pleasingly parallel computations. In this paper we present three pleasingly parallel biomedical applications, 1) assembly of genome fragments 2) sequence alignment and similarity search 3) dimension reduction in the analysis of chemical structures, implemented utilizing cloud infrastructure service based utility computing models of Amazon Web Services and Microsoft Windows Azure as well as utilizing MapReduce based data processing frameworks, Apache Hadoop and Microsoft DryadLINQ. We review and compare each of the frameworks and perform a comparative study among them based on performance, efficiency, cost and the usability. Cloud service based utility computing model and the managed parallelism (MapReduce) exhibited comparable performance and efficiencies for the applications we considered. We analyze the variations in cost between the different platform choices (eg: EC2 instance types), highlighting the need to select the appropriate platform based on the nature of the computation. 1. Introduction Scientists are overwhelmed with the increasing amount of data processing needs arising from the storm of data that is flowing through virtually every field of science. -
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters Qinghao Hu1,2 Peng Sun3 Shengen Yan3 Yonggang Wen1 Tianwei Zhang1∗ 1School of Computer Science and Engineering, Nanyang Technological University 2S-Lab, Nanyang Technological University 3SenseTime {qinghao.hu, ygwen, tianwei.zhang}@ntu.edu.sg {sunpeng1, yanshengen}@sensetime.com ABSTRACT 1 INTRODUCTION Modern GPU datacenters are critical for delivering Deep Learn- Over the years, we have witnessed the remarkable impact of Deep ing (DL) models and services in both the research community and Learning (DL) technology and applications on every aspect of our industry. When operating a datacenter, optimization of resource daily life, e.g., face recognition [65], language translation [64], adver- scheduling and management can bring significant financial bene- tisement recommendation [21], etc. The outstanding performance fits. Achieving this goal requires a deep understanding of thejob of DL models comes from the complex neural network structures features and user behaviors. We present a comprehensive study and may contain trillions of parameters [25]. Training a production about the characteristics of DL jobs and resource management. model may require large amounts of GPU resources to support First, we perform a large-scale analysis of real-world job traces thousands of petaflops operations [15]. Hence, it is a common prac- from SenseTime. We uncover some interesting conclusions from tice for research institutes, AI companies and cloud providers to the perspectives of clusters, jobs and users, which can facilitate the build large-scale GPU clusters to facilitate DL model development. cluster system designs. Second, we introduce a general-purpose These clusters are managed in a multi-tenancy fashion, offering framework, which manages resources based on historical data.