Týr: Storage-Based HPC and Big Data Convergence Using Transactional

Programa de Doctorado de Inteligencia Artificial Escuela TécnicaSuperior de Ingenieros Informáticos PhD Thesis Týr:Storage-Based HPC and Big Data Convergence Using Transactional Blobs Author: Pierre Matri Supervisors: Prof. Dra. Mar´ıaS. PérezHernández Prof. Dr. Gabriel Antoniu June, 2018 ii Tribunal nombrado por el Sr. Rector Magfco. de la Universidad Politécnicade Madrid, el d´ıa11 de Junio de 2018. Presidente: Dr. Antonio PérezAmbite Vocal: Dr. JesúsCarretero Pérez Vocal: Dr. Antonio CortésRoselló Vocal: Dr. Christian Pérez Secretario: Dr. Alberto Sánchez Campos Suplente: Dr. Antonio Garc´ıaDopico Suplente: Dr. Alejandro CalderónMateos Realizado el acto de defensa y lectura de la Tesis el d´ıa11 de Junio de 2018 en la Escuela TécnicaSuperior de Ingenieros Informáticos Calificación: EL PRESIDENTE VOCAL 1 VOCAL 2 VOCAL 3 EL SECRETARIO iii iv To Dimitris. To my family, who make it all worthwile. v vi Agradecimientos My very first comments will be directed to my three tremendous advisors, Mar´ıa, Alexandru and Gabriel. You have set an example of excellence as researchers, mentors, instructors, and tactful moderators of my wildest ideas. It is needless to say that this thesis would not have been possible without your continuous support, guidance and wisdom. I could not thank enough my fellow Ph.D students from the BigStorage project. I cannot express how greatful I am for having been awesome col- leagues, and above all awesome friends. Thank you all. Alvaro,´ Danilo, Dimitris, Eugene, Georgios, Fotis(es), Linh, Michal,Nafiseh, Ovidiu, Thanos, Umar, Yacine, you all helped in your own way making these three years an unforgettable experience, hence significantly contributing to this thesis. Many thanks to the members of the Ontology Engineering Group, who were here to welcome me, share support, bring tasty cakes, and force me to learn Spanish all along the way. For all this, thank you Ahmad, Daniel, Freddy, Idafen, Julia, Mar´ıa,Miguel Angel,´ Nelson, Pablo, Raúl,and everybody I don't have the space to name. A special thanks goes to Jose Angel´ and Ana for helping me through the endless admistrative dedalus. Similar comments goes to the tremendous KerData team I first started research with in Rennes, and with whom I finish writing this manuscript. Luis, Hadi, Nathanaël,Matthieu, Radu, Luc. You did definitely contribute to this thesis, all in your own way, around lunches, coffee and team meetings. I hereby devise my beloved coffee machine to the team. Thanks to Aurélie for her administrative support that prevented precocious hair loss. I want to also thanks Rob Ross and Phil Carns for warmly welcoming me at Argonne National Laboratory during three summer months. Besides being a fantastic research environment, I could also share precious time with awesome people I simply cannot forget acknowledging in here. Let me switch to french for one paragraph. Je tiens bien évidemment à remercier ma famille, pour leur soutien inconditionnel qui m'a toujours Figure 1: Coffee Distributor coffee Tea Empirically- 60:0 determined productivity estimation 40:0 relative to the avail- able coffee supply 20:0 around the office, with 95% confidence 0:0 intervals. Productivity (words /− min) 20:0 permis de me consacrer àla réalisationde mes rêves. Parce-que même au loin vous êtestoujours là,vous me supportez depuis toujours, et vous avez toujours cru en moi. Merci maman, papa. Merci Céline,Fran¸coiset Caroline. Merci Valéry, Marie et Arthur. Merci Sarah. Merci Gabriel. I need to dedicate a paragraph for all those who willingly or unwillingly contributed to this thesis. In no particular order, thanks to the reviewers of VLDB, CCGrid and Cluster that, by rejecting my contributions, led me to seek for the better. Thanks to Iberia, whose incalculable delays developed my ninja writing skills in airports around the world. Thanks to all the worldwide coffee supply chain, whose hard work and effective caffeine supported my productivity all along the way, as plotted in Figure 1. I want to finish with a very special mention for Dimitris, who left us way too soon, way too young, and to whom I dedicate this thesis. Dimitris, I could not thank you enough for being one of the most inspiring and cheerful souls I have ever known. I could not thank you enough for your courage and inspiration. I could not thank you enough for the laughter and the beers. I could not thank you enough for being you. I simply could not thank you enough. May you rest in peace, with all my love, my wishes and my thoughts. Abstract The increasingly growing data sets processed on HPC platforms raise major challenges for the underlying storage layer. A promising alternative to tra- ditional file-based storage systems are simpler blobs (binary large objects). They offer lower overhead and better performance at the cost of largely unused features such as file hierarchies or permissions. In a similar fashion, blobs are increasingly considered for replacing distributed file systems for Big Data Analytics (BDA) or as a base for storage abstractions like key- value stores or time-series databases. From these observations we advocate that blobs provide a solid storage model for convergence between HPC and BDA platforms. We identify data consistency as a hard problem to solve in this context because of the different choices made by both communities: while BDA developers typically rely on the storage system to provide data access coordination, the lack of such semantics on HPC platforms requires developers to use application-level tools for this task. In this thesis we pro- pose the key design principles of Týr,a converging storage system designed to answer the needs of both HPC and BDA applications, natively offering data access coordination in the form of transactions. We demonstrate the relevance and efficiency of its design in the light of convergence in multiple applicative contexts from both communities. These experiments validate that Týrdelivers its promise of high-throughput and versatility, hence fu- eling storage-based convergence between HPC and BDA. ix x Resumen La creciente cantidad de datos procesados en plataformas HPC supone un reto para el sistema de almacenamiento subyacente. Una alternativa prom- etedora a los sistemas de almacenamiento basado en ficheros es el uso de BLOBs (Binary Large OBjects). Esta alternativa ofrece menor sobrecarga y mejor rendimiento a cambio de eliminar caracter´ısticast´ıpicasde los sistemas de ficheros, como la jerarqu´ıaen forma de directorios o los permisos. De manera análoga,los blobs pueden utilizarse para reemplazar sistemas de ficheros en el áreade Big Data Analytics (BDA) o como base para otras abstracciones de almacenamiento, tales como bases de datos clave-valor o de series de tiempo. A partir de estas observaciones, podemos concluir que los blobs son un modelo de almacenamiento sólidopara lograr la convergencia entre plataformas HPC y BDA. En este contexto, uno de los problemas cr´ıticosque hay que resolver es la consistencia de los datos, debido a los diferentes elecciones de cada una de las comunidades: mientras que los desarrolladores de BDA delegan habitualmente la responsabilidad de coor- dinar el acceso a los datos al sistema de almacenamiento, la falta de dicha capacidad en las plataformas HPC requiere que los desarrolladores tengan que utilizar herramientas a nivel de aplicaciónpara realizar esta tarea. Esta tesis propone los principios de dise~noprincipales de Týr,un sistema de almacenamiento convergente dise~nadopara responder a las necesidades de aplicaciones HPC y BDA, ofreciendo de forma nativa la coordinaciónen el acceso a los datos en forma de transacciones. La tesis demuestra la rele- vancia y eficiencia de este dise~noaplicado a múltiplesescenarios de ambos campos. Los experimentos implementados muestran las caracter´ısticasde rendimiento y versatilidad ofrecidas por Týr,lo que supone un importante impulso para lograr la deseada convergencia entre HPC y BDA. xi xii Contents 1 Introduction 1 1.1 Objectives . .3 1.2 Contributions . .4 1.3 Publications . .5 1.4 Organization of the manuscript . .7 I Context: distributed storage, consistency for HPC and Big Data 9 2 HPC and BDA: Divergent stacks, convergent storage needs 11 2.1 Comparative overview of HPC and BDA stacks . 12 2.2 HPC: Centralized, file-based storage . 14 2.3 BDA: Modular, application-purpose storage . 15 2.4 Conclusions: challenges of storage convergence between HPC and BDA 16 3 Distributed storage paradigms 17 3.1 Distributed file systems . 18 3.2 Key-value stores . 22 3.3 Document, columnar, graph databases . 24 3.4 Blob storage . 27 3.5 Conclusions: which storage paradigm for converging storage? . 30 4 Consistency management 31 4.1 Storage consistency models . 32 4.2 Application-specific consistency management . 34 4.3 Transactional semantics . 36 xiii 4.4 CAP: Consistency, Availability, Partition tolerance . 40 4.5 Conclusions: which consistency model for converging storage? . 41 II Týr:a transactional, scalable data storage system 43 5 Blobs as storage model for convergence 45 5.1 General overview, intuition and methodology . 45 5.2 Storage call distribution for HPC and BDA applications . 49 5.3 Replacing file-based by blob-based storage . 52 5.4 Conclusion: blob is the right storage paradigm for convergence . 58 6 Týrdesign principles and architecture 61 6.1 Key design principles . 62 6.2 Predictable data distribution . 63 6.3 Transparent multi-version concurrency control . 64 6.4 ACID transactional semantics . 66 6.5 Atomic transform operations . 67 7 Týrprotocols 71 7.1 Lightweight transaction protocol . 72 7.2 Handling reads: direct, multi-chunk and transactional protocols . 74 7.3 Handling writes: transactional protocol, atomic transforms .

Load more