Programming Models to Support Data Science Workflows

Total Page:16

File Type:pdf, Size:1020Kb

Programming Models to Support Data Science Workflows UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH COMPUTER ARCHITECTURE DEPARTMENT (DAC) Programming models to support Data Science workflows PH.D. THESIS 2020 | SPRING SEMESTER Author: Advisors: Cristián RAMÓN-CORTÉS Dra. Rosa M. BADIA SALA VILARRODONA [email protected] [email protected] Dr. Jorge EJARQUE ARTIGAS [email protected] iii ”Apenas él le amalaba el noema, a ella se le agolpaba el clémiso y caían en hidromurias, en salvajes ambonios, en sustalos exas- perantes. Cada vez que él procuraba relamar las incopelusas, se enredaba en un grimado quejumbroso y tenía que envul- sionarse de cara al nóvalo, sintiendo cómo poco a poco las arnillas se espejunaban, se iban apeltronando, reduplimiendo, hasta quedar tendido como el trimalciato de ergomanina al que se le han dejado caer unas fílulas de cariaconcia. Y sin em- bargo era apenas el principio, porque en un momento dado ella se tordulaba los hurgalios, consintiendo en que él aprox- imara suavemente sus orfelunios. Apenas se entreplumaban, algo como un ulucordio los encrestoriaba, los extrayuxtaba y paramovía, de pronto era el clinón, la esterfurosa convulcante de las mátricas, la jadehollante embocapluvia del orgumio, los esproemios del merpasmo en una sobrehumítica agopausa. ¡Evohé! ¡Evohé! Volposados en la cresta del murelio, se sen- tían balpamar, perlinos y márulos. Temblaba el troc, se vencían las marioplumas, y todo se resolviraba en un profundo pínice, en niolamas de argutendidas gasas, en carinias casi crueles que los ordopenaban hasta el límite de las gunfias.” Julio Cortázar, Rayuela v Dedication This work would not have been possible without the effort and patience of the people around me. Thank you for encouraging me to get to this point and sharing so many great moments on the way. To Laura, for her kindness and devotion, and for supporting me through thick and thin despite my difficult character. Special thanks to my loving mother and father, who have al- ways guided and encouraged me to never stop learning; and who bought me my first computer. Last but not least, to my sweet little sister, Marta, who always stands on my side and whose affection and support keeps me always up. Wholeheartedly, Cristián Ramón-Cortés Vilarrodona vii Declaration of authorship I hereby declare that, except where specific reference is made to the work of others, this thesis dissertation has been composed by me and it is based on my own work. None of the contents of this dissertation has been previously published nor submitted, in whole or in part, to any other examination in this or any other university. Signed: Date: ix Acknowledgements I gratefully thank my supervisors Rosa M. Badia Sala and Jorge Ejarque Artigas for all their assistance during my career at the Barcelona Supercomputing Center (BSC-CNS) and for giving me the opportunity to collaborate on this project. I also thank all my colleagues, current and former members of the Workflows and Dis- tributed Computing team from the Barcelona Supercomputing Center (BSC), for their useful comments, remarks, and engagement through the learning process of this thesis: Francesc Lordan, Javier Alvarez, Francisco Javier Conejero, Daniele Lezzi, Ramon Amela, Pol Al- varez, Sergio Rodríguez, Carlos Segarra, Marc Dominguez, Salvi Solà, Hatem Elshazly, and Nihad Mammadli. On the other hand, this thesis has been financed by the Ministry of Economy and Com- petitiveness with a predoctoral grant under contract BES-2016-076791. This work has also been partially supported by the Spanish Government (SEV2015-0493); the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P); the Generalitat de Catalunya (con- tract 2014-SGR-1051); the Joint Laboratory for Extreme Scale Computing (JLESC); the Intel- BSC Exascale Lab; and the European Commission through the Horizon 2020 Research and Innovation program under contracts 687584 (TANGO project), 676556 (MuG project), 690116 (EUBra-BIGSEA project), 671591 (NEXTGenIO project), and 800898 (ExaQUte project). xi UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH Computer Architecture Department (DAC) Abstract Programming models to support Data Science workflows by Cristián RAMÓN-CORTÉS VILARRODONA Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, gene- ral-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, dis- tribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the pack- aging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automati- cally find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. Finally, we propose a way to extend task-based management systems to support continu- ous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end. Keywords: Distributed Computing, High-Performance Computing, Data Science pipe- lines, Task-based Workflows, Dataflows, Containers, COMPSs, PyCOMPSs, AutoParallel, Docker, Pluto, Kafka xiii UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH Computer Architecture Department (DAC) Resumen Programming models to support Data Science workflows por Cristián RAMÓN-CORTÉS VILARRODONA Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas científicas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogé- neos; combinando la ejecución de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computa- ciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en día, estas simulaciones representan un desafío incluso para los expertos ya que requieren cientos o miles de núcleos. Por esta razón, los lenguajes y modelos de programación actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable. Esta tesis contribuye a la adaptación de modelos de programación para la CAP para afrontar las necesidades y desafíos de los flujos de Data Science extendiendo COMPSs, un modelo de programación distribuida maduro, de propósito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un único modelo donde algunos pasos requieran tecnologías altamente optimizadas. Esta extensión incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python. En segundo lugar, integramos tecnologías de contenedores para permitir a los desarrol- ladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas. Además de una metodología sencilla para paralelizar aplicaciones a partir de códigos se- cuenciales, esta combinación proporciona una gestión de imágenes y
Recommended publications
  • Evaluating DDS, MQTT, and Zeromq Under Different Iot Traffic Conditions
    Evaluating DDS, MQTT, and ZeroMQ Under Different IoT Traffic Conditions Zhuangwei Kang Robert Canady Abhishek Dubey Vanderbilt University Vanderbilt University Vanderbilt University Nashville, Tennessee Nashville, Tennessee Nashville, Tennessee [email protected] [email protected] [email protected] Aniruddha Gokhale Shashank Shekhar Matous Sedlacek Vanderbilt University Siemens Technology Siemens Technology Nashville, Tennessee Princeton, New Jersey Munich, Germany [email protected] [email protected] [email protected] Abstract Keywords: Publish/Subscribe Middleware, Benchmark- ing, MQTT, DDS, ZeroMQ, Performance Evaluation Publish/Subscribe (pub/sub) semantics are critical for IoT applications due to their loosely coupled nature. Although OMG DDS, MQTT, and ZeroMQ are mature pub/sub solutions used for IoT, prior studies show that their performance varies significantly under different 1 Introduction load conditions and QoS configurations, which makes Distributed deployment of real-time applications and middleware selection and configuration decisions hard. high-speed dissemination of massive data have been hall- Moreover, the load conditions and role of QoS settings in marks of the Internet of Things (IoT) platforms. IoT prior comparison studies are not comprehensive and well- applications typically adopt publish/subscribe (pub/- documented. To address these limitations, we (1) propose sub) middleware for asynchronous and cross-platform a set of performance-related properties for pub/sub mid- communication. OMG Data Distribution Service (DDS), dleware and investigate their support in DDS, MQTT, ZeroMQ, and MQTT are three representative pub/sub and ZeroMQ; (2) perform systematic experiments under technologies that have entirely different architectures (de- three representative, lab-based real-world IoT use cases; centralized data-centric, decentralized message-centric, and (3) improve DDS performance by applying three and centralized message-centric, respectively).
    [Show full text]
  • This Paper Must Be Cited As
    Document downloaded from: http://hdl.handle.net/10251/64607 This paper must be cited as: Luzuriaga Quichimbo, JE.; Pérez, M.; Boronat, P.; Cano Escribá, JC.; Tavares De Araujo Cesariny Calafate, CM.; Manzoni, P. (2015). A comparative evaluation of AMQP and MQTT protocols over unstable and mobile networks. 12th IEEE Consumer Communications and Networking Conference (CCNC 2015). IEEE. doi:10.1109/CCNC.2015.7158101. The final publication is available at http://dx.doi.org/10.1109/CCNC.2015.7158101 Copyright IEEE Additional Information © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. A comparative evaluation of AMQP and MQTT protocols over unstable and mobile networks Jorge E. Luzuriaga∗, Miguel Perezy, Pablo Boronaty, Juan Carlos Cano∗, Carlos Calafate∗, Pietro Manzoni∗ ∗Department of Computer Engineering Universitat Politecnica` de Valencia,` Valencia, SPAIN [email protected], jucano,calafate,[email protected] yUniversitat Jaume I, Castello´ de la Plana, SPAIN [email protected], [email protected] Abstract—Message oriented middleware (MOM) refers to business application [6]. It works like instant messaging or the software infrastructure supporting sending and receiving email, and the difference towards these available
    [Show full text]
  • Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18
    Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL www.stc.org and summit.stc.org #STC18 SOCIETY FOR TECHNICAL COMMUNICATION 1 Efficiency exemplified Organizations globally use Adobe FrameMaker (2017 release) Request demo to transform the way they create and deliver content Accelerated turnaround time for 70% reduction in printing and paper customized publications material cost Faster creation and delivery of content 50% faster production of PDF for new products across devices documentation 20% faster development of Accelerated publishing across formats course content Increased efficiency and reduced 20% improvement in process translation costs while producing multilingual manuals efficiency For a personalized demo or questions, write to us at [email protected] © 2018 Adobe Systems Incorporated. All rights reserved. The papers published in these proceedings were reproduced from originals furnished by the authors. The authors, not the Society for Technical Communication (STC), are solely responsible for the opinions expressed, the integrity of the information presented, and the attribution of sources. The papers presented in this publication are the works of their respective authors. Minor copyediting changes were made to ensure consistency. STC grants permission to educators and academic libraries to distribute articles from these proceedings for classroom purposes. There is no charge to these institutions, provided they give credit to the author, the proceedings, and STC. All others must request permission. All product and company names herein are the property of their respective owners. © 2018 Society for Technical Communication 9401 Lee Highway, Suite 300 Fairfax, VA 22031 USA +1.703.522.4114 www.stc.org Design and layout by Avon J.
    [Show full text]
  • Advanced Architecture for Java Universal Message Passing (AA-JUMP)
    The International Arab Journal of Information Technology, Vol. 15, No. 3, May 2018 429 Advanced Architecture for Java Universal Message Passing (AA-JUMP) Adeel-ur-Rehman1 and Naveed Riaz2 1National Centre for Physics, Pakistan 2School of Electrical Engineering and Computer Science, National University of Science and Technology, Pakistan Abstract: The Architecture for Java Universal Message Passing (A-JUMP) is a Java based message passing framework. A- JUMP offers flexibility for programmers in order to write parallel applications making use of multiple programming languages. There is also a provision to use various network protocols for message communication. The results for standard benchmarks like ping-pong latency, Embarrassingly Parallel (EP) code execution, Java Grande Forum (JGF) Crypt etc. gave us the conclusion that for the cases where the data size is smaller than 256K bytes, the numbers are comparative with some of its predecessor models like Message Passing Interface CHameleon version 2 (MPICH2), Message Passing interface for Java (MPJ) Express etc. But, in case, the packet size exceeds 256K bytes, the performance of the A-JUMP model seems to be severely hampered. Hence, taking that peculiar behaviour into account, this paper talks about a strategy devised to cope up with the performance limitation observed under the base A-JUMP implementation, giving birth to an Advanced A-JUMP (AA- JUMP) methodology while keeping the basic workflow of the original model intact. AA-JUMP addresses to improve performance of A-JUMP by preserving its various traits like portability, simplicity, scalability etc. which are the key features offered by flourishing High Performance Computing (HPC) oriented frameworks of now-a-days.
    [Show full text]
  • Dcamp: Distributed Common Api for Measuring
    DCAMP: DISTRIBUTED COMMON API FOR MEASURING PERFORMANCE A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Alexander Paul Sideropoulos December 2014 c 2014 Alexander Paul Sideropoulos ALL RIGHTS RESERVED ii COMMITTEE MEMBERSHIP TITLE: dCAMP: Distributed Common API for Measuring Performance AUTHOR: Alexander Paul Sideropoulos DATE SUBMITTED: December 2014 COMMITTEE CHAIR: Michael Haungs, Ph.D. Associate Professor of Computer Science COMMITTEE MEMBER: Aaron Keen, Ph.D. Assistant Professor of Computer Science COMMITTEE MEMBER: John Bellardo, Ph.D. Associate Professor of Computer Science iii ABSTRACT dCAMP: Distributed Common API for Measuring Performance Alexander Paul Sideropoulos Although the nearing end of Moore's Law has been predicted numerous times in the past [22], it will eventually come to pass. In forethought of this, many modern computing systems have become increasingly complex, distributed, and parallel. As software is developed on and for these complex systems, a common API is necessary for gathering vital performance related metrics while remaining transparent to the user, both in terms of system impact and ease of use. Several distributed performance monitoring and testing systems have been proposed and implemented by both research and commercial institutions. How- ever, most of these systems do not meet several fundamental criterion for a truly useful distributed performance monitoring system: 1) variable data delivery mod- els, 2) security, 3) scalability, 4) transparency, 5) completeness, 6) validity, and 7) portability [30]. This work presents dCAMP: Distributed Common API for Measuring Per- formance, a distributed performance framework built on top of Mark Gabel and Michael Haungs' work with CAMP.
    [Show full text]
  • Efficient Support for Data-Intensive Scientific Workflows on Geo
    Efficient support for data-intensive scientific workflows on geo-distributed clouds Luis Eduardo Pineda Morales To cite this version: Luis Eduardo Pineda Morales. Efficient support for data-intensive scientific workflows ongeo- distributed clouds. Computation and Language [cs.CL]. INSA de Rennes, 2017. English. NNT : 2017ISAR0012. tel-01645434v2 HAL Id: tel-01645434 https://tel.archives-ouvertes.fr/tel-01645434v2 Submitted on 13 Dec 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Lucía, que no dejó que me rindiera. Acknowledgements One page is not enough to thank all the people who, in one way or another, have helped, taught, and motivated me along this journey. My deepest gratitude to them all. To my amazing supervisors Gabriel Antoniu and Alexandru Costan, words fall short to thank you for your guidance, support and patience during these 3+ years. Gabriel: ¡Gracias! For your trust and support went often beyond the academic. Alex: it has been my honor to be your first PhD student, I hope I met the expectations. To my family, Lucía, for standing by my side every time, everywhere; you are the driving force in my life, I love you.
    [Show full text]
  • VSI's Open Source Strategy
    VSI's Open Source Strategy Plans and schemes for Open Source so9ware on OpenVMS Bre% Cameron / Camiel Vanderhoeven April 2016 AGENDA • Programming languages • Cloud • Integraon technologies • UNIX compability • Databases • Analy;cs • Web • Add-ons • Libraries/u;li;es • Other consideraons • SoDware development • Summary/conclusions tools • Quesons Programming languages • Scrip;ng languages – Lua – Perl (probably in reasonable shape) – Tcl – Python – Ruby – PHP – JavaScript (Node.js and friends) – Also need to consider tools and packages commonly used with these languages • Interpreted languages – Scala (JVM) – Clojure (JVM) – Erlang (poten;ally a good fit with OpenVMS; can get good support from ESL) – All the above are seeing increased adop;on 3 Programming languages • Compiled languages – Go (seeing rapid adop;on) – Rust (relavely new) – Apple Swi • Prerequisites (not all are required in all cases) – LLVM backend – Tweaks to OpenVMS C and C++ compilers – Support for latest language standards (C++) – Support for some GNU C/C++ extensions – Updates to OpenVMS C RTL and threads library 4 Programming languages 1. JavaScript 2. Java 3. PHP 4. Python 5. C# 6. C++ 7. Ruby 8. CSS 9. C 10. Objective-C 11. Perl 12. Shell 13. R 14. Scala 15. Go 16. Haskell 17. Matlab 18. Swift 19. Clojure 20. Groovy 21. Visual Basic 5 See h%p://redmonk.com/sogrady/2015/07/01/language-rankings-6-15/ Programming languages Growing programming languages, June 2015 Steve O’Grady published another edi;on of his great popularity study on programming languages: RedMonk Programming Language Rankings: June 2015. As usual, it is a very valuable piece. There are many take-away from this research.
    [Show full text]
  • Apache Taverna
    Apache Taverna hp://taverna.incubator.apache.org/ Donal Fellows San Soiland-Reyes @donalfellows @soilandreyes [email protected] [email protected] hp://orcid.org/0000-0002-9091-5938 hp://orcid.org/0000-0001-9842-9718 Alan R Williams @alanrw [email protected] hp://orcid.org/0000-0003-3156-2105 This work is licensed under a Creave Commons Collaboraons Workshop 2015-03-26 A,ribuIon 3.0 Unported License. Taverna Workflow Ecosystem • Workflow Language — SCUFL2 (and t2flow) • Workflow Engine — Taverna • Used in… – Taverna Command Line Tool – Taverna Server – Taverna WorkBench • Allied services – myExperiment, workflow repository – Service Catalographer, service catalog software • Instantiated as BioCatalogue, BiodiversityCatalogue, … NERSC Workflow Day 2 Map of the Taverna Ecosystem Taverna Ruby client Taverna Online Applicaon-Specific Portals Player UI UI Plugins UI Plugins Plugins Taverna UI Plugins Lite Taverna Taverna Workbench UI Plugins Command UI Plugins Line Tool REST API Components Taverna Other TavernaUI Plugins APIs Server Servers Taverna SOAP API Taverna Core AcIvity Engine UI Ps UI Plugins Plugins Workflow Service many Repository Catalogs services… 3 Users, Scientific Areas, Projects Taverna In Use NERSC Workflow Day 4 Taverna Users Worldwide NERSC Workflow Day 5 Taverna Uses — Scientific Areas • Biodiversity — BioVeL project • Digital Preservation — SCAPE project • Astronomy — AstroTaverna product • Solar Wind Physics — HELIO project • In silico Medicine — VPH-Share project NERSC Workflow Day 6 Biodiversity: BioVeL • Virtual e-LaBoratory for
    [Show full text]
  • A Review of Scalable Bioinformatics Pipelines
    Data Sci. Eng. (2017) 2:245–251 https://doi.org/10.1007/s41019-017-0047-z A Review of Scalable Bioinformatics Pipelines 1 1 Bjørn Fjukstad • Lars Ailo Bongo Received: 28 May 2017 / Revised: 29 September 2017 / Accepted: 2 October 2017 / Published online: 23 October 2017 Ó The Author(s) 2017. This article is an open access publication Abstract Scalability is increasingly important for bioin- Scalability is increasingly important for these analysis formatics analysis services, since these must handle larger services, since the cost of instruments such as next-gener- datasets, more jobs, and more users. The pipelines used to ation sequencing machines is rapidly decreasing [1]. The implement analyses must therefore scale with respect to the reduced costs have made the machines more available resources on a single compute node, the number of nodes which has caused an increase in dataset size, the number of on a cluster, and also to cost-performance. Here, we survey datasets, and hence the number of users [2]. The backend several scalable bioinformatics pipelines and compare their executing the analyses must therefore scale up (vertically) design and their use of underlying frameworks and with respect to the resources on a single compute node, infrastructures. We also discuss current trends for bioin- since the resource usage of some analyses increases with formatics pipeline development. dataset size. For example, short sequence read assemblers [3] may require TBs of memory for big datasets and tens of Keywords Pipeline Á Bioinformatics Á Scalable Á CPU cores [4]. The analysis must also scale out (horizon- Infrastructure Á Analysis services tally) to take advantage of compute clusters and clouds.
    [Show full text]
  • Zeromq
    ZeroMQ Martin Sústrik <> ØMQ is a messaging system, or "message-oriented middleware", if you will. It's used in environments as diverse as financial services, game development, embedded systems, academic research and aerospace. Messaging systems work basically as instant messaging for applications. An application decides to communicate an event to another application (or multiple applications), it assembles the data to be sent, hits the "send" button and there we go—the messaging system takes care of the rest. Unlike instant messaging, though, messaging systems have no GUI and assume no human beings at the endpoints capable of intelligent intervention when something goes wrong. Messaging systems thus have to be both fault-tolerant and much faster than common instant messaging. ØMQ was originally conceived as an ultra-fast messaging system for stock trading and so the focus was on extreme optimization. The first year of the project was spent devising benchmarking methodology and trying to define an architecture that was as efficient as possible. Later on, approximately in the second year of development, the focus shifted to providing a generic system for building distributed applications and supporting arbitrary messaging patterns, various transport mechanisms, arbitrary language bindings, etc. During the third year the focus was mainly on improving usability and flattening the learning curve. We've adopted the BSD Sockets API, tried to clean up the semantics of individual messaging patterns, and so on. Hopefully, this chapter will give an insight into how the three goals above translated into the internal architecture of ØMQ, and provide some tips for those who are struggling with the same problems.
    [Show full text]
  • Presto: the Definitive Guide
    Presto The Definitive Guide SQL at Any Scale, on Any Storage, in Any Environment Compliments of Matt Fuller, Manfred Moser & Martin Traverso Virtual Book Tour Starburst presents Presto: The Definitive Guide Register Now! Starburst is hosting a virtual book tour series where attendees will: Meet the authors: • Meet the authors from the comfort of your own home Matt Fuller • Meet the Presto creators and participate in an Ask Me Anything (AMA) session with the book Manfred Moser authors + Presto creators • Meet special guest speakers from Martin your favorite podcasts who will Traverso moderate the AMA Register here to save your spot. Praise for Presto: The Definitive Guide This book provides a great introduction to Presto and teaches you everything you need to know to start your successful usage of Presto. —Dain Sundstrom and David Phillips, Creators of the Presto Projects and Founders of the Presto Software Foundation Presto plays a key role in enabling analysis at Pinterest. This book covers the Presto essentials, from use cases through how to run Presto at massive scale. —Ashish Kumar Singh, Tech Lead, Bigdata Query Processing Platform, Pinterest Presto has set the bar in both community-building and technical excellence for lightning- fast analytical processing on stored data in modern cloud architectures. This book is a must-read for companies looking to modernize their analytics stack. —Jay Kreps, Cocreator of Apache Kafka, Cofounder and CEO of Confluent Presto has saved us all—both in academia and industry—countless hours of work, allowing us all to avoid having to write code to manage distributed query processing.
    [Show full text]
  • ADMI Cloud Computing Presentation
    ECSU/IU NSF EAGER: Remote Sensing Curriculum ADMI Cloud Workshop th th Enhancement using Cloud Computing June 10 – 12 2016 Day 1 Introduction to Cloud Computing with Amazon EC2 and Apache Hadoop Prof. Judy Qiu, Saliya Ekanayake, and Andrew Younge Presented By Saliya Ekanayake 6/10/2016 1 Cloud Computing • What’s Cloud? Defining this is not worth the time Ever heard of The Blind Men and The Elephant? If you still need one, see NIST definition next slide The idea is to consume X as-a-service, where X can be Computing, storage, analytics, etc. X can come from 3 categories Infrastructure-as-a-S, Platform-as-a-Service, Software-as-a-Service Classic Cloud Computing Computing IaaS PaaS SaaS My washer Rent a washer or two or three I tell, Put my clothes in and My bleach My bleach comforter dry clean they magically appear I wash I wash shirts regular clean clean the next day 6/10/2016 2 The Three Categories • Software-as-a-Service Provides web-enabled software Ex: Google Gmail, Docs, etc • Platform-as-a-Service Provides scalable computing environments and runtimes for users to develop large computational and big data applications Ex: Hadoop MapReduce • Infrastructure-as-a-Service Provide virtualized computing and storage resources in a dynamic, on-demand fashion. Ex: Amazon Elastic Compute Cloud 6/10/2016 3 The NIST Definition of Cloud Computing? • “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” On-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf • However, formal definitions may not be very useful.
    [Show full text]