Deterministic Execution of Multithreaded Applications

DETERMINISTIC EXECUTION OF MULTITHREADED APPLICATIONS FOR RELIABILITY OF MULTICORE SYSTEMS DETERMINISTIC EXECUTION OF MULTITHREADED APPLICATIONS FOR RELIABILITY OF MULTICORE SYSTEMS Proefschrift ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K. C. A. M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op vrijdag 19 juni 2015 om 10:00 uur door Hamid MUSHTAQ Master of Science in Embedded Systems geboren te Karachi, Pakistan Dit proefschrift is goedgekeurd door de promotor: Prof. dr. K. L. M. Bertels Copromotor: Dr. Z. Al-Ars Samenstelling promotiecommissie: Rector Magnificus, voorzitter Prof. dr. K. L. M. Bertels, Technische Universiteit Delft, promotor Dr. Z. Al-Ars, Technische Universiteit Delft, copromotor Independent members: Prof. dr. H. Sips, Technische Universiteit Delft Prof. dr. N. H. G. Baken, Technische Universiteit Delft Prof. dr. T. Basten, Technische Universiteit Eindhoven, Netherlands Prof. dr. L. J. M Rothkrantz, Netherlands Defence Academy Prof. dr. D. N. Pnevmatikatos, Technical University of Crete, Greece Keywords: Multicore, Fault Tolerance, Reliability, Deterministic Execution, WCET The work in this thesis was supported by Artemis through the SMECY project (grant 100230). Cover image: An Artist’s impression of a view of Saturn from its moon Titan. The image was taken from http://www.istockphoto.com and used with permission. ISBN 978-94-6186-487-1 Copyright © 2015 by H. Mushtaq All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means without the prior written permission of the copyright owner. Printed in the Netherlands Dedicated to my family ACKNOWLEDGEMENTS I am thankful to Almighty Allah, who in his infinite mercy has guided me to complete my PhD. My deepest gratitude is for my supervisors Dr. Zaid Al-Ars and Professor Koen Ber- tels. Their professional guidance, enthusiastic encouragement, valuable support, useful critiques and constructive recommendations were the key motivations throughout my PhD. I have been amazingly fortunate to have an advisor like Dr. Zaid Al-Ars, who gave me the freedom to explore on my own and at the same time guided me whenever I fal- tered. I am thankful to him for carefully reading and commenting on countless revisions of research papers and encouraging me to meet the high standards. I am especially indebted to my beloved wife and children: Ismail and Khadija, for their unflagging support, strength and encouragement throughout my PhD and for un- derstanding the importance of this work and its hectic working hours. I would also like to thank the CE staff, which includes CE secretary Lidwina Tromp and system administrators Eef and Erik for their administrative and technical assistance throughout. Moreover, I take this opportunity to thank my fellow PhD students and friends, such as, Imran Ashraf, Nauman Ahmed, Mottaqiullah Taouil, Innocent Agbo, George Raz- van Voicu, Cuong Pham, Tariq Abdullah, Olivier Debant, Muhammad Nadeem, Faisal Nadeem, Seyab Khan, Fakhar Anjam, AN Tabish, Shah Muhammad, Shehzad Gishkori, Faisal Aslam, Laiq Hassan, Atif Khan and many more. They all helped make my time during PhD more fun and interesting. Furthermore, a special thanks to my family, for encouraging me in all my pursuits and inspiring me to follow my dreams. I have no words to express how grateful I am to my parents for all of the sacrifices that you have made on my behalf. Your prayers and support always benefited me. I would also like to thank my brothers Ishtiaq and Aleem and my sister Asma for their love and support. I am also highly obliged to my father-in-law and my mother-in-law for their encouragement and support. I am especially thankful to my mother-in-law for coming here and helping us during the birth of my daughter Khadija. Finally, I would like to dedicate this thesis to all my teachers, professors, friends, family and so many others whom I have inadvertently left out. I sincerely thank all of them for their help, support and guidance. Hamid Mushtaq Delft, June 2015 vii SUMMARY Constant reduction in the size of transistors has made it possible to implement many cores on a single die. However, smaller transistors are more susceptible to both tempo- rary and permanent faults. To make such systems more reliable, online fault tolerance techniques can be applied. A common approach for providing fault tolerance is to perform redundant execution of the software. This is done by using the program replication approach. In this approach, the replicated copies of a program (known as replicas) follow the same execution sequence and produce the same output if given the same input. This requirement necessitates that the replicas handle non-deterministic events such as asynchronous signals and non-deterministic functions deterministically. This is usually done by having one replica log the non-deterministic events and have the other replicas replay them at the same point in program execution. In a shared memory multithreaded program, this also means that the replicas perform non-deterministic shared memory accesses deterministically, so that they do not diverge in the absence of faults. In this thesis, we employed two techniques for doing so, which are record/replay and deterministic multithreading. Both of our schemes are implemented using a user-level library and do not require a modified kernel. Moreover, they are very portable since they do not depend upon any special hardware for deterministic execution. In addi- tion, we compare the advantages and disadvantages of both schemes in terms of performance, memory consumption and reliability. We also showed how our techniques improve upon existing techniques in terms of performance, scalability and portability. Lastly, we implemented specialized hardware extensions to further improve the performance and scalability of deterministic multithreading. Deterministic multithreading is useful not only for fault tolerance, but also for de- bugging and testing of multithreaded applications running on a multicore system. It can be useful in reducing the time needed to calculate the worst-case-execution-time (WCET) of tasks running on multicore systems, as deterministic multithreading reduces the possible number of states a multithreaded program can reach. Finding a good WCET estimate (less pessimistic) of a real time task is much simpler if it runs on a single core processor than if it runs on a multicore processor concurrently with other tasks. This is because those tasks can share resources, such as a shared cache or a shared bus, and/or may need to concurrently read and/or write shared data. In this thesis, we show that using deterministic shared memory accesses helps in reducing the possible number of states used by the estimation algorithm and therefore reduce the WCET calculation time. Moreover, we implemented optimizations to further reduce WCET calculation time as well as to get a tighter WCET estimate, besides utilizing our specialized hardware extensions for that purpose. ix SAMENVATTING Door de voortdurende verkleining van transistoren is het nu mogelijk geworden om meerdere processoren op een chip te plaatsen. Echter kleinere transistoren zijn gevoe- liger voor zowel tijdelijke en permanente fouten. Om dergelijke systemen betrouwbaar- der te maken kunnen online fouttolerantie technieken worden toegepast. Een mogelijke manier om fouttolerantie te implementeren is door het repliceren van een programma. In deze benadering worden de gerepliceerde kopieën van een programma (zogenaamde replica’s) op dezelfde manier uitgevoerd waarmee ze dezelfde output kunnen produce- ren. Om dit mogelijk te maken zouden wij asynchrone signalen en niet-deterministische functies deterministisch moeten maken. Dit wordt meestal gedaan door het opnemen de niet-deterministische gebeurtenissen bij de ene replica en daarna het afspelen van deze opgenomen gebeurtenissen bij de andere replica’s. In dit proefschrift hebben we gebruik gemaakt van twee technieken om deterministisch gedrag te garanderen: de record/replay en de deterministische multithreading. Al- lebei technieken zijn zeer draagbaar aangezien dat zij geen besturingssysteem kernel veranderingen of speciale hardware nodig hebben. Bovendien hebben wij in dit proefschrift de voor en nadelen van beide technieken vergeleken in termen van snelheid, ge- heugengebruik en betrouwbaarheid. We toonden ook aan hoe onze technieken verbeteren op bestaande technieken in termen van snelheid, schaalbaarheid en draagbaarheid. Ten slotte, hebben we een gespecialiseerde hardware-uitbreidingen voor deterministische multithreading geïmplementeerd om de snelheid en schaalbaarheid te verbeteren. Deterministische multithreading is niet alleen nuttig voor fouttolerantie, maar ook voor het testen van multithreaded applicaties die op een multicore systeem draaien. Dit kan nuttig zijn bij het verminderen van de tijd die nodig is om de worst-case-execution- time (WCET) te berekenen van programma’s die op multicore systemen draaien. Het berekenen van een goede WCET benadering is veel eenvoudiger als het programma op een enkele processor draait ten opzichte van meerdere processoren tegelijk. In dit proefschrift tonen wij dat het gebruik van een deterministisch geheugen model de complexiteit van het berekenen van WCET aanzienlijk kan verminderen. Bovendien hebben wij meerdere optimalisaties geïmplementeerd om de complexiteit nog

Deterministic Execution of Multithreaded Applications

CUDA Toolkit 4.2 CURAND Guide

Unit: 4 Processes and Threads in Distributed Systems

A Short History of Computational Complexity

Deterministic Parallel Fixpoint Computation

Byzantine Fault Tolerance for Nondeterministic Applications Bo Chen

Internally Deterministic Parallel Algorithms Can Be Fast

Entropy Harvesting in GPU[1] CUDA Code for Collecting Entropy

Deterministic Atomic Buffering

Complexity Theory Lectures 1–6

Optimal Oblivious RAM*

Semantics and Concurrence Module on Semantic of Programming Languages

00 Fast K-Selection Algorithms for Graphics Processing Units