Programming Models to Support Data Science Workflows
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH COMPUTER ARCHITECTURE DEPARTMENT (DAC) Programming models to support Data Science workflows PH.D. THESIS 2020 | SPRING SEMESTER Author: Advisors: Cristián RAMÓN-CORTÉS Dra. Rosa M. BADIA SALA VILARRODONA [email protected] [email protected] Dr. Jorge EJARQUE ARTIGAS [email protected] iii ”Apenas él le amalaba el noema, a ella se le agolpaba el clémiso y caían en hidromurias, en salvajes ambonios, en sustalos exas- perantes. Cada vez que él procuraba relamar las incopelusas, se enredaba en un grimado quejumbroso y tenía que envul- sionarse de cara al nóvalo, sintiendo cómo poco a poco las arnillas se espejunaban, se iban apeltronando, reduplimiendo, hasta quedar tendido como el trimalciato de ergomanina al que se le han dejado caer unas fílulas de cariaconcia. Y sin em- bargo era apenas el principio, porque en un momento dado ella se tordulaba los hurgalios, consintiendo en que él aprox- imara suavemente sus orfelunios. Apenas se entreplumaban, algo como un ulucordio los encrestoriaba, los extrayuxtaba y paramovía, de pronto era el clinón, la esterfurosa convulcante de las mátricas, la jadehollante embocapluvia del orgumio, los esproemios del merpasmo en una sobrehumítica agopausa. ¡Evohé! ¡Evohé! Volposados en la cresta del murelio, se sen- tían balpamar, perlinos y márulos. Temblaba el troc, se vencían las marioplumas, y todo se resolviraba en un profundo pínice, en niolamas de argutendidas gasas, en carinias casi crueles que los ordopenaban hasta el límite de las gunfias.” Julio Cortázar, Rayuela v Dedication This work would not have been possible without the effort and patience of the people around me. Thank you for encouraging me to get to this point and sharing so many great moments on the way. To Laura, for her kindness and devotion, and for supporting me through thick and thin despite my difficult character. Special thanks to my loving mother and father, who have al- ways guided and encouraged me to never stop learning; and who bought me my first computer. Last but not least, to my sweet little sister, Marta, who always stands on my side and whose affection and support keeps me always up. Wholeheartedly, Cristián Ramón-Cortés Vilarrodona vii Declaration of authorship I hereby declare that, except where specific reference is made to the work of others, this thesis dissertation has been composed by me and it is based on my own work. None of the contents of this dissertation has been previously published nor submitted, in whole or in part, to any other examination in this or any other university. Signed: Date: ix Acknowledgements I gratefully thank my supervisors Rosa M. Badia Sala and Jorge Ejarque Artigas for all their assistance during my career at the Barcelona Supercomputing Center (BSC-CNS) and for giving me the opportunity to collaborate on this project. I also thank all my colleagues, current and former members of the Workflows and Dis- tributed Computing team from the Barcelona Supercomputing Center (BSC), for their useful comments, remarks, and engagement through the learning process of this thesis: Francesc Lordan, Javier Alvarez, Francisco Javier Conejero, Daniele Lezzi, Ramon Amela, Pol Al- varez, Sergio Rodríguez, Carlos Segarra, Marc Dominguez, Salvi Solà, Hatem Elshazly, and Nihad Mammadli. On the other hand, this thesis has been financed by the Ministry of Economy and Com- petitiveness with a predoctoral grant under contract BES-2016-076791. This work has also been partially supported by the Spanish Government (SEV2015-0493); the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P); the Generalitat de Catalunya (con- tract 2014-SGR-1051); the Joint Laboratory for Extreme Scale Computing (JLESC); the Intel- BSC Exascale Lab; and the European Commission through the Horizon 2020 Research and Innovation program under contracts 687584 (TANGO project), 676556 (MuG project), 690116 (EUBra-BIGSEA project), 671591 (NEXTGenIO project), and 800898 (ExaQUte project). xi UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH Computer Architecture Department (DAC) Abstract Programming models to support Data Science workflows by Cristián RAMÓN-CORTÉS VILARRODONA Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, gene- ral-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, dis- tribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the pack- aging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automati- cally find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. Finally, we propose a way to extend task-based management systems to support continu- ous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end. Keywords: Distributed Computing, High-Performance Computing, Data Science pipe- lines, Task-based Workflows, Dataflows, Containers, COMPSs, PyCOMPSs, AutoParallel, Docker, Pluto, Kafka xiii UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH Computer Architecture Department (DAC) Resumen Programming models to support Data Science workflows por Cristián RAMÓN-CORTÉS VILARRODONA Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas científicas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogé- neos; combinando la ejecución de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computa- ciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en día, estas simulaciones representan un desafío incluso para los expertos ya que requieren cientos o miles de núcleos. Por esta razón, los lenguajes y modelos de programación actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable. Esta tesis contribuye a la adaptación de modelos de programación para la CAP para afrontar las necesidades y desafíos de los flujos de Data Science extendiendo COMPSs, un modelo de programación distribuida maduro, de propósito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un único modelo donde algunos pasos requieran tecnologías altamente optimizadas. Esta extensión incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python. En segundo lugar, integramos tecnologías de contenedores para permitir a los desarrol- ladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas. Además de una metodología sencilla para paralelizar aplicaciones a partir de códigos se- cuenciales, esta combinación proporciona una gestión de imágenes y