High-Performance Computing Methods in Large-Scale Power System Simulation

Lukas Razik Institute for Automation of Complex Power Systems

81

High-Performance Computing Methods in Large-Scale Power System Simulation

Von der Fakultät für Elektrotechnik und Informationstechnik der Rheinisch-Westfälischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation

vorgelegt von

Dipl.-Inform. Lukas Daniel Razik

aus Hindenburg

Berichter: Univ.-Prof. Antonello Monti, Ph. D. Univ.-Prof. Dr.-Ing. Andrea Benigni

Tag der mündlichen Prüfung: 8. Mai 2020

Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek online verfügbar. Bibliographische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb-nb.de abrufbar.

D 82 (Diss. RWTH Aachen University, 2020)

Herausgeber: Univ.-Prof. Dr.ir. Dr. h. c. Rik W. De Doncker Direktor E.ON Energy Research Center

Institute for Automation of Complex Power Systems (ACS) E.ON Energy Research Center Mathieustraße 10 52074 Aachen

E.ON Energy Research Center I 81. Ausgabe der Serie ACS I Automation of Complex Power Systems

Copyright Lukas Razik Alle Rechte, auch das des auszugsweisen Nachdrucks, der auszugsweisen oder vollständigen Wiedergabe, der Speicherung in Datenverarbeitungsanlagen und der Übersetzung, vorbehalten.

Printed in Germany

ISBN: 978-3-942789-80-6 1. Auflage 2020

Verlag: E.ON Energy Research Center, RWTH Aachen University Mathieustraße 10 52074 Aachen Internet: www.eonerc.rwth-aachen.de E-Mail: [email protected]

Zusammenfassung

In der seit 2009 geltenden Erneuerbare-Energien-Richtlinie der Europäi- schen Union haben sich die Mitgliedsstaaten darauf verständigt, dass der Anteil erneuerbarer Energien bis 2020 bei mindestens 20 % des Ener- gieverbrauchs liegen soll. Die damit einhergehende wachsende Zahl von erneuerbaren Energieerzeugern wie Photovoltaik- und Windkraftanlagen führt zu einer vermehrt dezentralen Stromerzeugung, die ein komplexeres Stromnetzmanagement erfordert. Um dennoch einen sicheren Netzbetrieb zu gewährleisten, findet ein Wandel von konventionellen Stromnetzen zu sogenannten Smart Grids statt, bei denen z. B. nicht nur Statusinformationen der Stromerzeuger son- dern auch der Verbraucher (z. B. Wärmepumpen und Elektrofahrzeuge) in das Netzmanagement einbezogen werden. Die Nutzung von Flexibilitäten auf Erzeugungs- und Nachfrageseite und der Einsatz von Energiespei- chern zur Erreichung einer stabilen und wirtschaftlichen Stromversorgung erfordert neue Lösungen für die Planung und den Betrieb von Smart Grids. Andernfalls können Veränderungen an den Systemen des öffentli- chen Energiesektors (Stromnetz, IKT-Infrastruktur, Energiemarkt usw.) zu unerwarteten Problemen und damit auch zu Stromausfällen führen. Computersimulationen können deswegen helfen, das Verhalten von Smart Grids bei Veränderungen abzuschätzen, ohne das Risiko negativer Folgen bei unausgereiften Lösungen oder Inkompatibilitäten einzugehen. Die wesentliche Zielsetzung der vorliegenden Dissertation ist die An- wendung und Analyse von Methoden des High-Performance Computings (HPC) und der Informatik zur Verbesserung von (Co-)Simulationssoftware elektrischer Energiesysteme, um komplexere Komponentenmodelle sowie größere Systemmodelle in angemessener Zeit simulieren zu können. Durch die zunehmende Automatisierung und Regelung in Smart Grids, die immer höheren Anforderungen an deren Flexibilität und die Notwendigkeit einer stärkeren Marktintegration der Verbraucher werden Stromnetzmodelle immer komplexer. Die Simulationen erfordern daher eine immer höhere

iii Leistungsfähigkeit der eingesetzten Rechnersysteme. Der Schwerpunkt der Arbeiten liegt deshalb auf der Verbesserung verschiedener Aspekte moderner und derzeit entwickelter Simulationslösungen. Dabei sollten jedoch keine neuen Simulationskonzepte oder -anwendungen entwickelt werden, die ein Hochleistungsrechnen auf Supercomputern oder großen Computerclustern erst erforderlich machen würden. Vielmehr werden in dieser Dissertation die Integrationen moderner direkter Löser für dünnbesetzte lineare Systeme in verschiedene Strom- netzsimulations-Backends und die anschließenden Analysen mithilfe von großskaligen Stromnetzmodellen vorgestellt. Darüber hinaus wird eine neue Methode zur automatischen grobgranularen Parallelisierung von Stromnetz-Systemmodellen auf Komponentenebene präsentiert. Neben solchen konkreten Anwendungen von HPC-Methoden auf Simulationsumge- bungen wird auch eine vergleichende Analyse verschiedener HPC-Ansätze zur Leistungssteigerung Python-basierter Software mithilfe von (Just-in- Time-)Kompilierern vorgestellt, da Python – in der Regel eine interpretierte Programmiersprache – im Bereich der Softwarenetwicklung im Energiesek- tor immer beliebter wird. Im Weiteren stellt die Dissertation die Integration einer HPC-Netzwerktechnologie auf Basis des offenen InfiniBand-Standards in ein Software-Framework vor, das für die Kopplung verschiedener Simu- lationsumgebungen zu einer Co-Simulation und für den Datenaustausch in Hardware-in-the-Loop (HiL) Aufbauten genutzt werden kann. Für die Verarbeitung von Energiesystemtopologien durch Simulations- umgebungen, auf denen die oben genannten HPC-Methoden angewendet wurden, ist die Unterstützung eines standardisierten Datenmodells not- wendig. Die Dissertation behandelt daher auch das Common Information Model (CIM), wie in IEC 61970 / 61968 standardisiert, welches für die Spezifikation von Datenmodellen zur Repräsentierung von Energiesystem- topologien verwendet werden kann. Zunächst wird ein gesamtheitliches Datenmodell vorgestellt, das für Co-Simulationen des Stromnetzes mit dem zugehörigen Kommunikationsnetz und dem Energiemarkt durch eine Erweiterung von CIM entwickelt wurde. Um eine nachhaltige Entwicklung von CIM-bezogenen Softwaretools zu erreichen, wird im Folgenden eine automatisierte (De-)Serializer-Generierung aus CIM-Spezifikationen vorge- stellt. Die Deserialisierung von CIM-Dokumenten ist ein Schritt, der für die anschließend entwickelte Übersetzung von CIM-basierten Netztopologien in simulatorspezifische Systemmodelle genutzt wird, die ebenfalls in dieser Dissertation behandelt wird. Viele der vorgestellten Erkenntnisse und Ansätze können auch zur Ver- besserung anderer Software im Bereich der Elektrotechnik und darüber hinaus genutzt werden. Zudem wurden alle in der Dissertation vorgestell-

iv ten Ansätze in öffentlich zugänglichen Open-Source-Softwareprojekten implementiert.

v

Abstract

In the Renewables Directive of the European Union, in effect since 2009, the member states agreed that the share in renewable energy should be 20 % of the total energy by 2020. The concomitantly growing number of renewable energy producers such as photovoltaic systems and wind power plants leads to a more decentralized power generation. This results in a more complex power grid management. To ensure a secure power grid operation even so, there is a transformation from conventional power grids to so-called smart grids where, for instance, not only status information of power producers but also of consumers (e. g. heat pumps and electrical vehicles) is included in the power grid management. The utilization of flexibility on generation and demand side and the use of energy storage systems for achieving a stable and economic power supply requires new solutions for the planning and operation of smart grids. Otherwise, manipulations of the systems in the public energy sector (i. e. power grid, information and communications technology (ICT) infrastructure, energy market, etc.) can lead to unexpected problems such as power failures. Computer simulations therefore can help to estimate the behavior of smart grids on any changes without the risk of negative consequences in case of immature solutions or incompatibilities. The main objective of this dissertation is the application and analysis of high-performance computing (HPC) and computer science methods for improving power system (co-)simulation software to allow simulating more detailed models in a, for the particular use case, appropriate time. Through more automation and control in smart grids, the higher demand on flexibility, and the need of stronger market integration of consumers, the power system models become more and more complex. This requires an ever greater performance of the utilized computer systems. The focus was on the improvement of different aspects of state-of-the-art and currently developed simulation solutions. The intention was not to develop new

vii simulation concepts or applications that would make large-scale HPC on super-computers or large computer clusters necessary. The dissertation presents the integration of modern direct solvers for sparse linear systems in various power grid simulation back-ends and subsequent analyses with the aid of large-scale power grid models. Fur- thermore, a new method for an automatic coarse-grained parallelization of power grid system models at component level is shown. Besides such concrete applications of HPC methods on simulation environments, also a comparative analysis of various HPC approaches for performance im- provement of Python based software with the aid of (just-in-time) com- pilers is presented, as Python – usually an interpreted programming language – becomes more popular in the area of power system related software. Moreover, the dissertation shows the integration of an HPC interconnect solution based on InfiniBand – an open standard – in a soft- ware framework for the coupling of different simulation environments to a co-simulation and for Hardware-in-the-Loop (HiL) setups. The support of a standardized data model for the processing of power system topologies by simulation environments, on which the aforemen- tioned HPC methods were applied, is necessary. Therefore, the dissertation concerns the Common Information Model (CIM) as, i. a., standardized by IEC 61970 / 61968, which can be used for the specification of data models representing power system topologies. At first, a holistic data model is introduced that was developed for co-simulations of the power grid with the associated communication network and the energy market by extending CIM. To achieve a sustainable development of CIM related software tools, an automated (de-)serializer generation from CIM spec- ifications is presented. The deserialization from CIM is a step needed for the subsequently developed template-based translation from CIM to simulator-specific system models which is also covered in this dissertation. Many presented findings and approaches can be used for improving further software from the area of electrical engineering and beyond that. Moreover, all presented approaches were implemented in open-source software projects, accessible by the public.

viii Acknowledgement

I would like to thank the following people: My doctoral supervisor, Prof. Antonello Monti, for the guidance and the support of my initiatives throughout my whole time as doctoral student at the Institute for Automation of Complex Power Systems, my second reviewer, Prof. Andrea Benigni, for the kind feedback to my dissertation manuscript, and Prof. Ferdinanda Ponci for the helpful feedback and support regarding my scientific publications. My colleagues Jan Dinkelbach, for reading the manuscript (especially the boring parts) and the great support during my way from a computer scientist to an engineer, Markus Mirz, for a great cooperation as well as the inclusion of my humble self in interesting additional projects and activities, Steffen Vogel for the assistance in software-technical matters, Simon Pickartz for the sophisticated LATEX template, and Stefan Dähling for proofreading the final version. All student researchers and students who participated in the research and development related to this dissertation. The Réseau de Transport d’Électricité co-workers Adrien Guironnet and Gautier Bureau for a successful and enjoyable cooperation.



Vor allem möchte ich meinen Eltern danken, die auf vieles verzichtet und es mir durch Ihre Unterstützung erst ermöglicht haben, diesen beruflichen Weg zu beschreiten. Zu guter Letzt danke auch Dir, mein Schatz, für Deine Unterstützung und Geduld während meiner Promotionszeit!

Aachen, May 2020 Lukas Daniel Razik

ix

Contents

Acknowledgement viii

List of Publications xv

1 Introduction 1 1.1 Challenges in Smart Grids ...... 1 1.2 Large-Scale Multi-Domain Co-Simulation as a Solution .. 3 1.3 Contribution ...... 6 1.4 Outline ...... 11

2 Multi-Domain Co-Simulation 13 2.1 Fundamentals and Related Work ...... 14 2.1.1 Architecture and Topology Data Model ...... 14 2.1.2 Common Information Model ...... 15 2.1.3 Simulation of Smart Grids ...... 16 2.1.4 Classification of Simulations ...... 16 2.2 Use Case ...... 17 2.3 Challenges ...... 18 2.4 Concept of the Co-Simulation Environment ...... 19 2.4.1 Holistic Topology Data Model ...... 19 2.4.2 Model Data Processing and Simulation Setup ... 22 2.4.3 Synchronization ...... 23 2.4.4 Co-Simulation Runtime Interaction ...... 24 2.5 Validation by Use Case ...... 26 2.6 Conclusion ...... 27

3 Automated De-/Serializer Generation 29 3.1 CIM Formalisms and Formats ...... 31 3.2 CIM++ Concept ...... 33

xi Contents

3.3 From CIM UML to Compilable C++ Code ...... 35 3.3.1 Gathering Generated CIM Sources ...... 37 3.3.2 Refactoring Generated CIM Sources ...... 38 3.3.3 Primitive CIM Data Types ...... 40 3.4 Automated CIM (De-)Serializer Generation ...... 41 3.4.1 The Common Base Class ...... 41 3.4.2 Integrating an XML Parser ...... 42 3.4.3 Unmarshalling ...... 43 3.4.4 Unmarshalling Code Generator ...... 46 3.4.5 Marshalling ...... 49 3.5 libcimpp Implementation ...... 50 3.6 Evaluation ...... 50 3.7 Conclusion and Outlook ...... 51

4 From CIM to Simulator-Specific System Models 55 4.1 CIMverter Fundamentals ...... 57 4.1.1 Modelica ...... 57 4.1.2 Template Engine ...... 59 4.2 CIMverter Concept ...... 59 4.3 CIMverter Implementation ...... 62 4.3.1 Mapping from CIM to Modelica ...... 63 4.3.2 CIM Object Handler ...... 64 4.4 Modelica Workshop Implementation ...... 65 4.4.1 Base Class of the Modelica Workshop ...... 66 4.4.2 CIM to Modelica Object Mapping ...... 66 4.4.3 Component Connections ...... 67 4.5 Evaluation ...... 68 4.6 Conclusion and Outlook ...... 70

5 Modern LU Decompositions in Power Grid Simulation 75 5.1 LU Decompositions in Power Grid Simulation ...... 76 5.1.1 From DAEs to LU Decompositions ...... 76 5.1.2 LU Decompositions for Linear System Solving .. 78 5.1.3 KLU, NICSLU, GLU, and Basker by Comparison . 80 5.2 Analysis of Modern LU Decompositions for Electrical Circuits 83 5.2.1 Analysis on Benchmark Matrices from Large-Scale Grids ...... 84 5.2.2 Analysis on Power Grid Simulations ...... 92 5.3 Conclusion and Outlook ...... 95

xii Contents

6 Exploiting Parallelism in Power Grid Simulation 97 6.1 Parallelism in Simulation Models ...... 98 6.1.1 Task Scheduling ...... 100 6.1.2 Task Parallelization in DPsim ...... 106 6.1.3 System Decoupling ...... 110 6.2 Analysis of Task Parallelization in DPsim ...... 111 6.2.1 Use Cases ...... 112 6.2.2 Schedulers ...... 113 6.2.3 System Decoupling ...... 117 6.2.4 Compiler Environments ...... 122 6.3 Conclusion and Outlook ...... 124

7 HPC Python Internals and Benefits 127 7.1 HPC Python Fundamentals ...... 129 7.1.1 Classical Python ...... 130 7.1.2 PyPy ...... 136 7.1.3 Numba ...... 139 7.1.4 Cython ...... 143 7.2 Benchmarking Methodology ...... 147 7.3 Comparative Analysis ...... 150 7.4 Conclusion and Outlook ...... 155

8 HPC Network Communication for HiL and RT Co-Simulation 157 8.1 VILLAS Fundamentals ...... 158 8.2 InfiniBand Fundamentals ...... 159 8.2.1 InfiniBand Architecture ...... 161 8.2.2 OpenFabrics Software Stack ...... 165 8.3 Concept of InfiniBand Support in VILLAS ...... 167 8.3.1 VILLASnode Basics ...... 167 8.3.2 Original Read and Write Interface ...... 167 8.3.3 Requirements on InfiniBand Node-Type Interface . 170 8.3.4 Memory Management of InfiniBand Node-Type .. 171 8.3.5 States of InfiniBand Node-Type ...... 172 8.3.6 Implementation of InfiniBand Node-Type ..... 173 8.4 Analysis of the InfiniBand Support in VILLAS ...... 175 8.4.1 Service Types of InfiniBand Node-Type ...... 178 8.4.2 InfiniBand vs. Zero-Latency Node-Type ...... 181 8.4.3 InfiniBand vs. Existing Server-Server Node-Types 182 8.5 Conclusion and Outlook ...... 183

9 Conclusion 185 9.1 Summary and Discussion ...... 185

xiii Contents

9.2 Outlook ...... 189

A Code Listings 193 A.1 Exploiting Parallelism in Power Grid Simulation ..... 193

B Python Environment Measurements 195 B.1 Execution Times ...... 195 B.2 Memory Space Consumption ...... 197

List of Acronyms 201

Glossary 207

List of Figures 209

List of Tables 213

Bibliography 215

xiv List of Publications

Journal Articles

[DRM20] S. Dähling, L. Razik, and A. Monti. “OWL2Go: Auto-genera- tion of Go data models for OWL ontologies with integrated serialization and deserialization functionality”. In: To appear in SoftwareX (2020). [Raz+19b] L. Razik, N. Berr, S. Khayyam, F. Ponci, and A. Monti. “REM-S-–Railway Energy Management in Real Rail Opera- tion”. In: IEEE Transactions on Vehicular Technology 68.2 (Feb. 2019), pp. 1266–1277. doi: 10.1109/TVT.2018.2885007. [Kha+18] S. Khayyamim, N. Berr, L. Razik, M. Fleck, F. Ponci, and A. Monti. “Railway System Energy Management Optimiza- tion Demonstrated at Offline and Online Case Studies”. In: IEEE Transactions on Intelligent Transportation Systems 19.11 (Nov. 2018), pp. 3570–3583. issn: 1524-9050. doi: 10. 1109/TITS.2018.2855748. [Mir+18] M. Mirz, L. Razik, J. Dinkelbach, H. A. Tokel, G. Alirezaei, R. Mathar, and A. Monti. “A Cosimulation Architecture for Power System, Communication, and Market in the Smart Grid”. In: Hindawi Complexity 2018 (Feb. 2018). doi: 10. 1155/2018/7154031. [Raz+18a] L. Razik, M. Mirz, D. Knibbe, S. Lankes, and A. Monti. “Auto- mated deserializer generation from CIM ontologies: CIM++— an easy-to-use and automated adaptable open-source library for object deserialization in C++ from documents based on user-specified UML models following the Common Infor- mation Model (CIM) standards for the energy sector”. In: Computer Science - Research and Development 33.1 (Feb.

xv Contents

2018), pp. 93–103. issn: 1865-2042. doi: 10.1007/s00450- 017-0350-y. [Raz+18b] L. Razik, J. Dinkelbach, M. Mirz, and A. Monti. “CIMverter— a template-based flexibly extensible open-source converter from CIM to Modelica”. In: Energy Informatics 1.1 (Oct. 2018), p. 47. issn: 2520-8942. doi: 10.1186/s42162-018- 0031-5. [Gre+16] F. Gremse, A. Höfter, L. Razik, F. Kiessling, and U. Nau- mann. “GPU-accelerated adjoint algorithmic differentiation”. In: Computer Physics Communications 200 (2016), pp. 300– 311. issn: 0010-4655. doi: 10.1016/j.cpc.2015.10.027. [Fin+09b] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “Low- Latency Linux Drivers for Ethernet over High-Speed Net- works”. In: IAENG International Journal of Computer Sci- ence 36.4 (2009).

Book Chapters

[Fin+10] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “Trans- parent Integration of a Low-Latency Linux Driver for Dolphin SCI and DX”. In: Electronic Engineering and Computing Tech- nology. Ed. by S.-I. Ao and L. Gelman. Dordrecht: Springer Netherlands, 2010, pp. 539–549. isbn: 978-90-481-8776-8. doi: 10.1007/978-90-481-8776-8_46.

Conference Articles

[Raz+19a] L. Razik, L. Schumacher, A. Monti, A. Guironnet, and G. Bureau. “A comparative analysis of LU decomposition meth- ods for power system simulations”. In: 2019 IEEE Milan PowerTech. June 2019, pp. 1–6. [Vog+17] S. Vogel, M. Mirz, L. Razik, and A. Monti. “An Open Solution for Next-generation Real-time Power System Simulation”. In: 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2). Nov. 2017, pp. 1–6. doi: 10.1109/EI2. 2017.8245739.

xvi Conference Articles

[Pic+16] S. Pickartz, N. Eiling, S. Lankes, L. Razik, and A. Monti. “Mi- grating LinuX Containers Using CRIU”. In: High Performance Computing. Ed. by M. Taufer, B. Mohr, and J. M. Kunkel. Cham: Springer International Publishing, 2016, pp. 674–684. isbn: 978-3-319-46079-6. [Var+11] E. Varnik, L. Razik, V. Mosenkis, and U. Naumann. “Fast Conservative Estimation of Hessian Sparsity”. In: Fifth SIAM Workshop on Combinatorial Scientific Computing, May 19–21, 2011, Darmstadt, Germany. May 2011, pp. 18–21. [Fin+09a] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “ETH- OM, an Ethernet over SCI and DX Driver for Linux”. In: Proceedings of 2009 International Conference of Parallel and Distributed Computing (ICPDC 2009), London, UK. 2009. [Fin+08] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “ETHOS, a generic Ethernet over Sockets Driver for Linux”. In: Proceed- ings of the 20th IASTED International Conference. Vol. 631. 017. 2008, p. 239.

xvii

1

Introduction

In 1993 there were announcements in German newspapers saying that sun, water, and wind will also in long-range not cover more than 4 percent of the electricity demand [Büt16]. As early as 2007 their share on the electricity supply amounted to 14.2 percent. This was also the year when the first official definition of smart grid was provided by the Energy Independence and Security Act of 2007 [Con07] which was approved by the US Congress in January 2007. Meanwhile, the term smart grid is used worldwide for research, development, and investment programs with regard to technology innovations and the expansion of power grids. The principle approaches for the transformation of conventional power grids to smart grids were developed in the experts team Advisory Council of the Technology Platform for Europeans in the years from 2005 to 2008 to establish a conceptual base for a secure grid integration of significant electrical generation capacities on the basis of renewable, mostly volatile and weather-dependent energy sources.

1.1 Challenges in Smart Grids

Smart grids particularly require an improved coordination of grid operation and grid user behavior with the aid of information and communications technology (ICT) and with the objective to ensure the sustainability of economical, reliable, secure as well as ecofriendly power supply in the environment of increased energy efficiency and decreased greenhouse gas emissions. For instance, Smart Distribution plays a major role in the area

1 Chapter 1 Introduction of smart grids. It can be divided into three pillars with the following challenges [BS14a; BS14b]: 1. Automation and remote control of local distribution grids: e. g. voltage control at distribution level (traditional as well as including the grid users), possibilities of power flow control, accelerated fault location and resumption of normal grid operation, as well as enhanced protection concepts;

2. Flexibility by virtual power plants (VPPs): i. e. demand side man- agement and benefits of VPPs in a perspectival market organization;

3. Smart Metering and market integration of consumers: i. e. dynamic tariffs, demand side response, and electromobility. Since the efficiency of proper solutions can be improved by collecting and analyzing of information (i. e. data), a research field called Energy Informatics has been established around 2012 with conferences such as the DACH+ Conference on Energy Informatics and the ACM e-Energy Conference. Furthermore, Centers for Energy Informatics for example at the University of Southern Denmark and the University of Southern California were founded to address ICT challenges of smart grids, e. g. with the help of artificial intelligence and machine learning approaches. However, new approaches and international standards in the area of ICT are not sufficient. Besides regulatory (i. e. legal) aspects, the introduction of new market rules is also necessary. A successful realization of the European goals for global warming gases reduction, the increase of energy efficiency as well as the continuously rising use of renewable energy sources require a harmonized design of the interrelations between all participants in the process of electrical power supply [BS14a; BS14b]. Both, the proponents as well as the opponents of renewable energies agree: the contribution of 37.8 percent from renewable energy sources to gross electricity consumption in Germany [Umw19] can be significantly increased in the long term only with a utilization of flexibility (on generation and demand side) and the use of energy storage systems. Since modifications of the subsystems involved in the energy sector (i. e. power grid, ICT infrastructure, energy market, etc.) involve the risk of technical and economical faults such as destabilizations, major changes should not be made without an accurate analysis of possible effects on the power system. Computer simulations can help to estimate the behavior of systems on their modifications with the aid of mathematical models to avoid negative consequences that could occur in real systems. In the

2 1.2 Large-Scale Multi-Domain Co-Simulation as a Solution following, different kinds of power system simulation are introduced and it is motivated why large-scale multi-domain co-simulation is a solution for the here presented three pillars of challenges in smart grids.

1.2 Large-Scale Multi-Domain Co-Simulation as a Solution

There are different types of (co-)simulations depending on their goals. Depending on the considered aspects, simulation types can be classified by mathematical models with, e. g., pure algebraic equations for steady-state observations or ordinary differential equations (ODEs) for dynamic observations; simulation time which, e. g., can be continuous for “floating” physical processes or discrete for events that occur at particular points in time, marking a change of the system’s state; orchestration which, e. g., can be hybrid, when multiple system models from different domains are simulated by the same solver, or a co- simulation, when multiple system models are computed by different simulation solvers which are coupled (i. e. exchanging information during the simulation) [Sch+15]. Obviously, this list is not complete. The next sections, however, shall provide a more general overview of simulation types with some of their goals to motivate the contribution of this work. First, it shall be differentiated between online and offline simulations.

Online Simulation Online simulations are, e. g., performed for steady-state security assessment (SSA) and dynamic security assessment (DSA). In case of SSA, power flow simulations of a sequence of (n-1)-states are needed to examine the abid- ance of the principle that, under the predicted maximum transmission and supply responsibilities, the grid security is ensured even when a component such as a transformer or a line becomes unplanned inoperative. The DSA is based on dynamic simulation which supplements the steady-state grid security calculations with calculations of power plant dynamics in case of close short circuits and grid equipment outage. These dynamic stability calculations can be very time-consuming. In Germany this means that for a timely availability of DSA results, to be used as decision aid for the dispatcher, all (n-1)-scenarios should be available within 5 minutes.

3 Chapter 1 Introduction

This means that around 100 dynamic stability calculations must be ac- complished within this time frame. Such a real-time (RT) requirement is very challenging and hence requires an intelligent management of the calculation cases as well as low simulation execution times [BS14a; BS14b].

Offline Simulation

Offline simulations (steady-state, electromagnetic transient simulation (EMT), etc.) are performed, e. g., for grid expansion planning, maintenance planning, commissioning of new operating equipment, and so forth. As offline simulations are not performed simultaneously to grid operation, they do not have any RT requirements. Nevertheless, low simulation execution times are important to obtain simulation results in acceptable time frames when many use cases or scenarios (e. g. the same power grid with various switching events, changing its topology during simulation) have to be simulated or in case of simulation models with thousands of nodes.

Large-Scale Simulation

Such simulations with several thousand of nodes, called large-scale, be- come important when simulation environments shall be applicable also on real-world scenarios rather than for lab experiments only. Though there are commercial simulation tools which allow large-scale power grid simulations for certain use cases, they have a significant disadvantage: they are closed source and thus changes of existing models (i. e. com- ponent models) or the solvers are often not possible. However, further development of models is an essential concern of scientific research to adapt them for future applications in smart grids as the lack of inertia in power grids through a decreasing ratio of big power generators and more distributed energy resources (DERs) can lead to frequency instabilities that cannot be simulated by conventional models. Hence, at the Institute for Automation of Complex Power Systems (ACS) new methods and concepts are implemented into open-source software which can be used and im- proved by everyone. Here, it should be noted that not only publicly funded scientific facilities can benefit from open-source simulation software but also economic enterprises, of which some increasingly count on open-source alternatives instead of closed source products. For instance, RTE-France, the French transmission system operator (TSO), also develops open-source simulation environments such as Dynaωo [Gui+18]. But, as in case of com- mercial software, compliance with international standards of associations such as CIGRE, IEC, IEEE, and VDE are crucial for the comparability

4 1.2 Large-Scale Multi-Domain Co-Simulation as a Solution of solution approaches, study results, and applicability in existing system environments.

Co-Simulation

Especially in case of co-simulation – a definition is given by [Sch+15]– where multiple simulators are coupled together, standardized data models for information exchange between them are usually necessary. There are single-domain and multi-domain co-simulations. Single domain co-simulations can be conducted with and without RT requirements. Without RT requirements, co-simulations can be useful if the involved simulators have complementary features but there is no need for a synchronization of the simulation time with the real time (i. e. the co-simulation time can run slower or faster than the wall clock). Particularly in the power grids domain, RT requirements can come into play, e. g. with (power) Hardware-in-the-Loop (HiL), Control-in-the-Loop (CiL), and Software-in-the-Loop (SiL) use cases where a solution (i. e. an embedded system such as a control device) has to be connected to a simulated environment to verify its correct functioning within a real environment. A special case hereof is the geographically distributed real- time simulation (GD-RTS) which is based on the concept of a virtual interconnection of laboratories in real time [Mon+18]. In this concept, a monolithic simulation model is partitioned into subsystem portions that are simulated concurrently on multiple digital real-time simulators (DRTSs). As a result, comprehensive and large-scale real-world scenarios can be simulated for the validation of the interoperability of novel hardware and software solutions with the existing power grid and without the need for any arbitrary in-the-Loop setups to be located at the same facility. Multi-domain co-simulation in the following denotes a coupling of one or more power grid simulators with other simulators of different domains such as ICT, market, weather, and so on, for the obtainment of a holistic view of the power system. Therefore, the term power system in this work does not stand for power grid only but for the power grid together with any associated system such as the ICT infrastructure and the energy market in a holistic view. This is the key for an extensive analysis and understanding of smart grids as depicted in [BS14a; BS14b]. In the previous sections, the use of large-scale single- and multi-domain co-simulation as a solution for the analysis and development of smart grids with a continually growing share of renewable energy sources is motivated. The merits through the application of simulations during power system operation as well as for the planning of power systems as conducted since decades is undisputed.

5 Chapter 1 Introduction

Due to the three main challenges arising through the transition to smart grids (see Sect. 1.1) – more automation and control in local distribution grids (e. g. because of the needed digitalization), the higher demand on flexibility (e. g. by demand side management), and the need of stronger market integration of consumers – the power system models become more and more complex. This requires an ever greater performance of the utilized computer systems.

1.3 Contribution

The main objective of this dissertation is the application of high-performance computing (HPC) methods in the area of Energy Informatics and their analysis for improving power system (co-)simulation software to allow sim- ulating more complex component models as well as larger system models in an appropriate time. While in the past processor performance increased continuously with increasing CPU clock rates, since around 2005 this is not the case anymore because of the power wall [Bos11]. From then on, computer performance was increased by a growing numbers of cores per processor and accelerators such as, e. g., graphic processing units (GPUs) and Intel Xeon Phi adapters. Nowadays, the power draw is not a problem of central processing units (CPUs) only but also of whole supercomputers. Therefore, while the trend to more parallelism continues, HPC system designers are more and more turning to hardware architectures respec- tive accelerators with high power efficiency (usually measured by FLOPS per watt) like GPUs, Advanced RISC Machines (ARM) processor based systems or field-programmable gate array (FPGA) accelerators [Gag+19]. As in case of multi-core and manycore systems with special instructions sets for performance improvements (e. g. vector instructions), the software nowadays must be adapted continuously to make use of such new hardware features and accelerators. Under these circumstances, the focus is on the improvement of different aspects on state-of-the-art and currently devel- oped simulation solutions in academic area as well as enterprises. Thus, the intention was not to develop new simulation concepts or applications that would make large-scale HPC on supercomputers or large computer clusters necessary. However, especially computer and network hardware of modern commodity clusters is in focus of the contribution. Figure 1.1 shows the real-world challenge of an improved coordination of smart grid operation and grid user behavior. This is addressed by a solution based on an appropriate and therefore increasingly complex modeling as well as (co-)simulation for smart grid planning and operation. The three major aspects of the solution, the contribution of this work

6 1.3 Contribution

Transition of conventional power grids to smart grids

Challenge: Smart grids require improved coordination of grid operation and grid user behavior

Solution: Appropriate and more complex modeling and (co-)simulation for smart grid planning and operation

Information Modeling Simulation exchange

Chapter 3: Chapter 5: Chapter 7: Automated Modern LU HPC Python (de-)serializer decompositions internals and generation from in power grid benefits CIM UML simulation

Chapter 2: Chapter 4: Chapter 6: Chapter 8: Multi-domain co- From CIM-based Exploiting HPC network simulation with a topologies to parallelism in communication holistic CIM- simulator-specific power grid for HiL and RT based topology system models simulation co-simulation data model

High-performance computing and energy informatics

Figure 1.1: Contribution overview of this work

7 Chapter 1 Introduction refers to, are modeling, simulation, and information exchange. Arrows from bottom to the top illustrate the contribution of this work to these major aspects of large-scale power system (co-)simulation. On the one hand, mathematical component models become more com- plex, for example because of an increasing use of power electronics, on the other hand, the complexity of system models increases for example because of new electrical equipment and facilities, in case of smart grids ever more often with connections to other domains such as ICT, weather, mechanics, energy market, etc. which also need to be simulated. There- fore, a contribution of this thesis is the presentation of a multi-domain co-simulation architecture with a holistic (i. e. multi-domain) topology data model which is based on the Common Information Model (CIM) as standardized by IEC 61970 / 61968 / 62325, describing terms in the energy sector and relations between them. CIM plays an important role as it belongs to the IEC core semantic standards for smart grids [LE]. CIM makes use of the Unified Modeling Language (UML) which is state-of-the-art in computer science for the specification of classes and their relationships in object-oriented software design. This thesis therefore contributes with the concept of an automated (de-)serializer generation from a specification based on UML. Among others, the automated code generation process implements a CIM data model in C++ according to the given UML specification. It can be applied whenever the UML specification changes between its versions what usually happens a couple of times per year. This avoids manual changes in a code base with currently around one thousand classes and many relations between each other which would be very time-consuming and error-prone. The resulting (de-)serializer allows reading in CIM documents in C++, according to the CIM-based data model, modifying the data in the main memory, and writing the data into CIM documents. Due to CIM’s fine granularity over several abstraction levels, a compo- nent (e. g. power transformer) consists of many CIM objects. This is a reason why a mapping from CIM to a simulator-specific system model is intricate. However, when a mapping to a system model of a certain simulator is achieved, the mapping often can also be used for system models of different simulators. Therefore, a template-based mapping from CIM to system models is proposed. The templates allow a specification on how model parameters from a CIM document have to be written into the simulator-specific system model target format. The advantage of templates is that if the system model format is written in a given language (e. g. Modelica), the templates are written in the same language with place holders for the data from a CIM object to be mapped. Therefore, the user

8 1.3 Contribution does not need to learn another language for specifying the system model format. Simulation environments make use of various approaches for the transi- tion from a system model to a system of mathematical equations, which is solved by the simulation solver. In case of power grid simulations, for instance, the resistive companion approach in combination with the New- ton(-Raphson) method can be applied which results in a linear system of equations (LSE), for each time step and Newton iteration. In another ap- proach, all component models can be combined into a differential-algebraic system of equations (DAE) which is then passed to a DAE solver which finally linearizes it to LSEs as well. For power grid simulations, LSEs typi- cally are very sparse (i. e. the fraction of non-zero elements in the matrix is typically much less than 1 ) and therefore require appropriate LSE solvers. The contribution in thish work is a comparative analysis of several modern LU decompositions for the solution of sparse LSEs coming from power grids against KLU 1 which is a well-established LU decomposition for electric circuits and therefore taken as the reference. The LU decom- positions concerned are called modern as they are developed especially for current multi-core or massively parallel computer architectures. The comparison is based on benchmark matrices that arose during power grid simulation and simulations performed by existing simulation environments in which the most promising LU decompositions were integrated. There are various methods of expressing parallelism in power system simulation. On the one hand, the processing within a simulation solver can be parallelized for instance with the aid of a certain parallel programming paradigm in the solver’s programming language (e. g. with parallel constructs using OpenMP in C++ [Ope19b]). Similarly to that approach, on the other hand, parallelism in a system model can also be expressed with the aid of a formalism for parallel structures in the model (e. g. with parallel constructs in the modeling language ParModelica [Geb+12]). Besides such an explicit expression of parallelism in the solver or model, it is also possible to extract parallelism, e. g., from mathematical models at equation level which is a variant of already existing automatic fine-grained parallelization of mathematical models. The contribution of this work is, however, the introduction of an automatic exploitation of parallelism in system models at component level and therefore called an automatic coarse-grained parallelization of mathematical models. For this coarse- grained parallelization of mathematical models, parallel task schedulings are introduced. Accordingly, various task schedulers allow the parallel

1The “K” in KLU stands for “Clark Kent” which is the bourgeois identity behind the fictional superhero Superman. This is an allusion to SuperLU, which is a well-known LU decomposition for sparse linear systems [DP10].

9 Chapter 1 Introduction processing of component models related tasks within one simulation step. An analysis of the whole implementation shows the execution time speedups with respect to different scheduling methods and other modeling and software technical aspects. Power system simulation requires not only simulation itself but also data processing before the simulation (e. g. load and generation profiles), during the simulation (e. g. data exchanged between simulators), and after the simulation (e. g. simulation results). Since Python, as a modern and relatively easy-to-learn script language, is enjoying ever growing popularity under programming beginners, many power engineers program diverse parts of software projects in the area of power system simulation in Python. Especially the pre- and postprocessing of simulation data is performed in Python, while the simulation cores are often programmed in other programming languages such as C++. Sometimes execution times of (usually interpreted) Python applications are too long for given use cases and there is not enough time or a lack of know-how to port the Python application to a more runtime-efficient language such as, e. g., C++. Admittedly, there are Python modules, just-in-time (JIT) compilers, and Python language extensions which allow improving the runtime efficiency of Python programs but their internals and benefits are rather unknown. The contribution of this work is therefore an overview and comparative analysis of the most popular approaches for the performance improvement of Python, not necessarily with the aid of parallelization (e. g. multithreading). Co-simulations as well as HiL setups require an information exchange between simulators as well as between devices and simulators. Especially in case of RT applications, short latencies in information exchange can be crucial. To reduce latencies, HPC interconnects, in contrast to commonly used interconnects, provide connection modes in which data is directly transmitted to or read from the main memory of a remote server without involving the operating system or a process running on a remote server as it is usually the case. Therefore, a contribution of this thesis is the presentation of InfiniBand (IB), a widely-used HPC network communica- tion standard, and its integration in a state-of-the-art software framework that can, for instance, be freely used for hard RT coupling of devices with simulators in case of HiL setups as well as for the coupling of simulators in case of hard RT co-simulations with very low latencies. All the contributed approaches were implemented or integrated in exist- ing or new open-source software projects which can be used and investi- gated. Moreover, the concepts and analyses introduced in this work for an improved modeling, simulation, and information exchange shall support other researchers, developers and users of (co-)simulation software.

10 1.4 Outline

1.4 Outline

Chapter 2 shows the benefits of multi-domain co-simulation and introduces an appropriate co-simulation environment for the three smart grid domains power grid, communication network, and energy market, developed in the research project SINERGIEN. This SINERGIEN environment is the start for several approaches, concepts, and analyses which are presented in the following chapters. The usage of UML for the specification of CIM allows extending it to a holistic topology data model that is used for the SINERGIEN co-simulation environment with simulators for the three mentioned domains. Chapter 3 presents the automated (de-)serializer generation from a specification based on UML. The automated deserializer generation is implemented in the CIM++ software project which can map CIM, as specified by UML, to a C++ code base, also implementing the holistic CIM-based data model. The thus created open-source software library allows reading and writing arbitrary CIM-based documents in C++. Chapter 4 shows the approach on how CIM-based documents for power grid topology representation can be translated into simulator-specific system models with the aid of template documents. In the SINERGIEN environment this became necessary for the power grid simulator based on Modelica to run simulations of power grid topologies stored in CIM-based documents, as CIM is used more and more by distribution system operators (DSOs) and TSOs. The translation from CIM to a simulator-specific model was implemented in the open-source software CIMverter. It uses template documents making it possible to modify the simulator-specific system models to be outputted in case the input format of the target simulator is changing, e. g. because of a newer version which allows to set more parameters or new component models to be included in a system model. This allows a flexible adaption of the translation from CIM to a supported simulator-specific model without a recompilation of CIMverter which is also shown in this chapter. Chapter 5 outlines the comparative analysis of several modern LU de- compositions for sparse linear systems. In the first part of the analysis they are compared by different benchmark matrices arising from simulations of large-scale power grids. This analysis was a help for deciding which LU decomposition is worth to be integrated into existing simulation envi- ronments. In the second part, the most promising modern decomposition (after its integration) is compared with the reference decomposition by simulations with both a fixed time step and a variable time step solver. Therefore, these LU decompositions were i. a. integrated into the DAE

11 Chapter 1 Introduction solver used by the open-source simulation environments OpenModelica and Dynaωo. Chapter 6 presents the approach for exploiting parallelism in power grid simulation from the newly introduced type of approaches described as automatic coarse-grained parallelization of mathematical models for a higher performance through thereby enabled parallel computations in power system simulators. This approach is applied on a newly developed open-source power grid simulator called DPsim. At first, the implemented parallelization approach is categorized into the existing parallelism cate- gories of simulation models. Moreover, an overview of formally defined scheduling methods for the parallel processing of data independent tasks is provided. It follows a performance analysis of the implemented task parallelization methods. Chapter 7 provides an overview of the internals of HPC approaches to improve the runtime of Python applications and an comparative analysis of these approaches. The comparative analysis is based on various benchmark algorithms of different algorithm classes that were programmed in Python and in C++, as an efficient reference. This comparative analysis can help Python programmers to chose the right approach for increasing the performance of Python applications with or without multithreading, with threads that are executed really in parallel which is not always the case in Python as will be explained, too. Chapter 8 presents the integration of a HPC network communication into HiL and RT co-simulation. The HPC interconnect solution chosen for the integration in the open-source VILLASframework, that can be utilized for the setup of HiL simulations and (even hard) RT coupling of DRTSs, is based on IB. IB was chosen as it is an open standard that is implemented by various manufacturers. The integration of IB is also compared with other communication methods provided by the VILLASframework. Chapter 9 concludes the dissertation, providing a summary and discus- sion on all topics of this work. Moreover, it gives an overview of future work that can be conducted for an improvement of the introduced concepts as well as their analyses and implementations.

12 2

Multi-Domain Co-Simulation

More and more distributed energy resources (DERs) at distribution level cause bidirectional power flows between distribution and transmission level which require changes in the related information and communications technology (ICT) and energy market mechanisms. Therewith associated extension of the measurement devices in lower voltage layers, for instance, require appropriate communication network capabilities for meeting the requirements on the exchange of measurement data between the mea- surement devices and all involved entities such as control centers and substations. Therefore, electrical grids and the belonging communication networks should be planned holistically to take the interactions between both domains into account [Li+14]. Apart from that, new energy market models are developed for customers (i. e. prosumers) to empower them to a more active role in the exchange of energy with the grid [WH16] in a way that their behavior will be considered in grid operation [EFF15] and possibly vice versa. Given these facts, it is reasonable to include also the energy market simulation in the planning to get a holistic picture of future grids. An integration of energy market mechanisms, the communication net- work, and power grid hamper future studies on power grids due to a lack of established modeling approaches which encompass the three domains and there are only few tools which enable a joint simulation. In this chapter a comprehensive data model is presented together with a co-simulation architecture based on it. Both enable an investigation of dynamic inter- actions between power grid, communication network and market. Such interactions can be technical constraints of the grid which require actions

13 Chapter 2 Multi-Domain Co-Simulation on market side as well as communication failures which affect the com- munication between grid and market or market decisions that change the behavior of a generation unit or energy prosumers connected to the grid. For this purpose, a data model based on the Common Information Model (CIM) as standardized in IEC 61970/61968/62235 was created to be able to describe an entire smart grid topology with components and actors from all three domains. This data model is called SINERGIEN_CIM as it resulted in the research project SINERGIEN. It allows the storage of the whole network topology with components from all three domains in a single well-defined data model, hiding some complexity of the simulation from the users. SINERGIEN_CIM-based topology descriptions are being processed by the co-simulation architectures as presented in [Mir+18]. Some parts of the SINERGIEN co-simulation architecture will be ad- dressed in the following as they are relevant for the research and develop- ment that is presented in the following chapters of this dissertation. After a section on the related work and another one on various use cases for multi-domain simulation, the challenges for the realization of the implemented SINERGIEN co-simulation environment are discussed. It follows a section about the concept and a further one on its validation by a use case. The chapter is concluded with final remarks in its last section. The work in this chapter has been partially presented in [Mir+18]1.

2.1 Fundamentals and Related Work

2.1.1 Architecture and Topology Data Model

A major formal modeling method for future intelligent power grids is given by the Smart Grid Architecture Model (SGAM) [Sma12]. Therefore, the SGAM framework provides five levels: for physical components in the network (component layer), protocols for information exchange between services or systems (communication layer), data models which define the rules for data structures (information layer), functions and services (function layer), as well as business and market models (business layer). Furthermore, the model divides all two-dimensional layers into the domain dimension from generation over transmission, and so forth to customer premises and in the zones dimension from process over field, and so on, to the market.

1 “A Cosimulation Architecture for Power System, Communication, and Market in the Smart Grid” by Markus Mirz, Lukas Razik, Jan Dinkelbach, Halil Alper Tokel, Gholamreza Alirezaei, Rudolf Mathar, and Antonello Monti is licensed under CC BY 4.0

14 2.1 Fundamentals and Related Work

SGAM shall accelerate and standardize the development of unified data models, services, and applications in industry and research. In this context, the SINERGIEN data model and the co-simulation framework build upon SGAM as follows: • the unified data model formally defines the data exchange structure in alignment with the information layer concept of SGAM (see Sect. 2.4);

• the domain-specific simulators of our co-simulation environment include models of power grid and communication network components as well as market actors in the distribution, DER, and customer premise domains of the SGAM component layer;

• the communication layer is abstracted by a co-simulation interface and software extensions for the particular domain-specific simulators in order to enable data exchange between the components (see Sect. 2.4);

• the example use case presented in Sect. 2.2 with an optimal man- agement of distributed battery storage systems is an example of a system function that would fall on the SGAM function layer. Fur- thermore, the business model motivating the provision of a proper system function, e. g., an incentive by a distribution system operator (DSO) is defined within the business layer. For our unified data model we chose CIM as well-established basis for power grid data that can be extended in a flexible manner. An extension of CIM was needed for communication infrastructure and energy market as for example shown in [Haq+11] and [Fre+09].

2.1.2 Common Information Model Some of the most important smart grid related standards (i. e. core stan- dard) are from the IEC Technical Committee 57 (IEC TC 57). The so- called CIM is standardized in IEC 61970 (Energy Management Systems), IEC 61968 (Distribution Management), and IEC 62325 (Energy Market Communications) [IEC12b; IEC12a; IEC14]. Therefore, CIM belongs to the core standards included in the IEC/TR 62357 reference architecture [IEC; IEC16b].Originally CIM was developed as a database model for energy management systems (EMSs) and supervisory control and data acquisition (SCADA) systems but then changed into an object-oriented approach for electric distribution, transmission, and generation. Use cases of CIM are system integration using pre-defined interfaces between the

15 Chapter 2 Multi-Domain Co-Simulation

IT of distribution management systems (DMSs) and automation parts, custom system integration using XML-based payloads for semantically sound coupling of systems, and serializing topology data using the Resource Description Framework (RDF) [Usl+12]. The IEC considers CIM and the IEC 61850 series as the pillars for a realization of the smart grid objectives of interoperability and device management [LE].

2.1.3 Simulation of Smart Grids

Example approaches for co-simulations on power grids and communication are presented in [Li+14; Hop+06; ZCN11; Lin+12] with focus on short- termed effect and therefore not including the energy market. In MOCES [EFF15] a holistic approach is taken for modeling distributed energy systems but the result is a monolithic co-simulation and no co-simulation with a hybrid simulation for the physical part and an agent-based part for behavior-based simulations, e. g., coming from the market. With the SINERGIEN co-simulation environment the advantages of existing tools shall be harnessed which enhances the credibility of simulation results and obviate reinventing the wheel. The SINERGIEN co-simulation platform consists of several domain specific simulators with the possibility to use the “best tool” for each domain.

2.1.4 Classification of Simulations

In [Sch+15] a classification scheme for energy-related co-simulations is introduced, with the four modeling categories continuous processes, discrete processes / events, roles, and statistical elements. The power grid in the SINERGIEN co-simulation environment is modeled based on Modelica.A short introduction to Modelica is provided in Sect. 4.1.1. Thermal systems [Mol+14] as well as power grids [MNM16] were modeled in Modelica. The Modelica models in the SINERGIEN co-simulation express continuous as well as discrete processes and events which makes the power grid simulation a hybrid simulation. The communication network is simulated with available discrete event simulation (DES) tools, such as ns-3. In a DES the simulation time does not proceed continuously but with the arising of certain events such as packet arrival, time expiry etc. [WGG10]. The energy market simulation was implemented also as DES but in Python which is flexible and suitable to test different optimization methods [Har+12]. Each market participant aims at optimizing the schedule for its assets, e. g., minimizing energy costs and maximizing its profit. Examples for statistical elements are, e. g., wind farm models of the power grid simulator.

16 2.2 Use Case

In view of the above, the SINERGIEN co-simulation environment is formalized as a coupled Discrete Event System Specification (DES) as defined in [ZPK00]. This formalization is shown in Sect. 2.4.

2.2 Use Case

The SINERGIEN environment can be used for an evaluation of different scenarios with fast phenomena in the range from microseconds to seconds (i. e. with smaller simulation time steps) between highly dynamic power grid components, e. g., power electronics and communication network and slow phenomena in the range from minutes to hours (i. e. with larger simulation time steps) that include market entities, power grid, and communication network.

More on these two phenomena classes can be found in [Mir+18] with a focus on slow phenomena and a discussion on fast phenomena containing a description of adaptions needed for fast phenomena investigations. Based on this classification it can be concluded that the three simulators do not necessarily need to participate in each co-simulation. The example use case that was chosen in [Mir+18] for a validation of the SINERGIEN environment was an optimal management of distributed storage systems for peak-shaving to support the grid operation. The SINERGIEN environ- ment including the communication network allows testing the effects of communication failures on the operation strategy and eventually on the electrical grid, which can provide valuable insights for decision making. Simulation results for this example are also provided in [Mir+18]. Before the co-simulation is initiated, it is necessary to define and store the topology under investigation along with the scenario-specific parame- ters. For example, various scenarios in which failures in the communication network are stochastically or deterministically set by the user in the data model can be examined. From a user perspective it would be advanta- geous if all components, their links, and parameters could be defined in one environment rather than splitting this information between different software solutions and formats. Then, the data model for the topology needs to include components that couple different domains. Under these requirements, following challenges were identified:

• definition of a common data model that includes components of all domains and their interconnections;

17 Chapter 2 Multi-Domain Co-Simulation

• interaction of simulators with different simulation types, e. g., event- driven for the communication network and continuous processes for the power grid;

• choice of the co-simulation time step which is limited by the synchro- nization method connecting the simulators.

2.3 Challenges

A major issue in coupling of simulators with different modeling approaches is the selection and implementation of a synchronization mechanism which ensures proper progress of the simulation time and a timely data ex- change between the simulators. This selection is of crucial significance for minimization of the error propagation in the co-simulation and the synchronization overhead in terms of simulation time. Since this is out of scope of this work, please refer to [Mir+18] for more details, whereas the definition and implementation of a new proper data model, involving all three mentioned domains, is crucial for the whole following work on large-scale co-simulation.

Holistic Topology Data Model

A common data model that covers the power grid, communication infras- tructure and electrical market was not existent. Besides the benefit for the user of a co-simulation environment with a single data model, for the specification of a holistic co-simulation topology, also the data exchange between simulators is simplified. A system description that encompasses all components of smart grids as shown in Fig. 2.1 (1) can be either used directly by a single multi-domain smart grid simulator or divided into subsystems for a co-simulation as in Fig. 2.1 (2). For many components, this division is obvious since their parameters are only needed by one domain-specific simulator but some components (called inter-domain com- ponents) constitute natural coupling points between the three domains. For instance, a battery storage device connected to the grid can act as a market participant that offers its capability to charge or discharge. In order to enable its participation in the energy market, the battery storage needs an interface which is a communication modem in this case. The modem can be seen as a part of the battery storage. For a co-simulation, the information on inter-domain components must be split into several parts as each simulator has to simulate a dedicated part of these components.

18 2.4 Concept of the Co-Simulation Environment

2.4 Concept of the Co-Simulation Environment

2.4.1 Holistic Topology Data Model

As already mentioned, a holistic data model for a whole three domains co-simulation topology can be based on CIM with an extension by further classes. These classes, introduced for completing CIM in its representation of smart grids, are linked to already existing CIM classes using the Unified

Wholesale and Retail Market

Market Participants

(1)

Wholesale and Retail Market

Market Participants

Communication Power Grid Market Network (2)

Figure 2.1: Exemplary topology including components of (1) all domains and (2) domain-specific topologies

19 Chapter 2 Multi-Domain Co-Simulation

Modeling Language (UML). The proposed format can be structured in four packages: • Original CIM (IEC 61970 / 61968 / 62325),

• Communication,

• Market, and

• EnergyGrid. Whenever suitable, original CIM classes are considered. However, some components do not have an associated class in the standard yet and therefore are added in one of the other three packages. This approach leads to a flexible update to a new CIM version without losing the added classes with their links. The most important feature of the SINERGIEN data model is the interconnection of domains. Examples of inter-domain components, namely BatteryStorage, SolarGeneratingUnit, and MarketCogeneration, are shown in Fig. 2.2, an excerpt from the SINERGIEN data model. According to the UML diagram, the energy market components are associated to the power grid components, whereas power grid components have an

Market

Equipment:: Equipment:: Equipment:: MarketBatteryStorage MarketSolarGeneratingUnit MarketCogenerationUnit

Power Grid RegulatingCondEq GeneratingUnit PowerSystemResource

EnergyStorage:: Production:: Production:: BatteryStorage SolarGeneratingUnit CogenerationPlant

Communication

Modems::ComMod

Figure 2.2: Inter-domain connections between classes of power grid, com- munication network and market

20 2.4 Concept of the Co-Simulation Environment aggregation relationship to communication devices. This means that parameters specific to the market, communication network, and power grid which relate to the same device are linked with each other. Therefore, all information on one device is easily accessible but at the same time there is a separation according to the domains. The connections between classes of different domains are defined in a logical and not a topological manner. Instead, topological connections exist to interconnect power grid components, for instance. In the mentioned battery storage device example, the data model is as follows: the device is a part of the grid and has electrical parameters. Furthermore, the battery storage might participate in the market, e. g., as part of a virtual power plant (VPP). Market-specific information can be stored in MarketBatteryStorage class objects which is associated with the BatteryStorage. The communication modem ComMod which could be used to communicate with the VPP is aggregated to the BatteryStorage class. The three additional packages EnergyGrid, Communication, and Market are needed for the following: • some newer components occurring in power grids are missing in orig- inal CIM. For instance, it was necessary to create a new model for electrical energy storages like stationary batteries. A battery storage is a conducting equipment that is able to regulate its energy through- put in both directions. Therefore, the class BatteryStorage added in the EnergyGrid package is a specialization of a CIM Regulating- ConductingEquipment since it can influence the flow of power at a specific point in the grid.

• the key component of the Market package for the scenarios that we would like to investigate is a VPP since the aggregation of small DER units enables their participation at electricity markets.

• the Communication package includes all additionally defined classes that are related to the communication network model, such as classes for communication links and technologies, modems, network nodes along with their parameters and their relations with the classes in CIM, power grid package, and market package. Figure 2.3 shows an excerpt from the communication data model with an aggregation to a WindGeneratingUnit. By means of the associated classes for modems, communication requirements and channels, the model enables a description of network parameters and topology. More on the packages can be found in [Mir+18].

21 Chapter 2 Multi-Domain Co-Simulation

2.4.2 Model Data Processing and Simulation Setup

The overall information flow for the simulation setup is depicted in Fig. 2.4. After the holistic topology is edited in a graphical Topology Builder, including all objects of the three domains, it is forwarded to the co- simulation interface. In order to execute a simulation, the Modelica solver requires a Modelica model, whereas the communication network topology can be given to the communication network simulator in CIM format, which includes the components of the network, their connections, and parameters. The co-simulation interface incorporates a component called CIMverter, based on CIM++ [Raz+18a] presented in Chap. 3. The CIMverter [Raz+18b] reads in the CIM document and outputs a Modelica system model (Chap. 4) for the power grid simulator. In contrast, the Python-based market simulation relies on a C++/Python interface, which could be realized using one of the common libraries for Python to wrap C++ data types and functions, to retrieve the market relevant information from the C++ objects and store them in Python objects. A detailed explanation of the translation from CIM to Modelica is given in Chap. 4.

GeneratingUnit WindGeneratingUnit 1 1 0..* 0..1

ComMod CommunicationRequirement

WirelessMod WiredMod

WiredInterface LTEModem FiberModem FiberInt 1 1..* 1 1..* WiredChannel FiberChannel

Figure 2.3: Communication network class association example

22 2.4 Concept of the Co-Simulation Environment

2.4.3 Synchronization

The synchronization during simulation is performed at fixed time steps. For slow phenomena scenarios this is managed by mosaik, a well-established co-simulation framework [SST11]. It allows coupling the three simulators in a simple manner as explained in Sect. 2.4.4 in case of longer synchro- nization time steps. VILLASnode, a software project for coupling real-time simulations in LANs [Vog+17; Ste+17], is a suitable alternative for mosaik in the case of synchronizations with very short synchronization time steps. In Modelica, the synchronization data exchange is achieved by inte- grating Modelica blocks of the Modelica_DeviceDrivers library, which was originally developed for interfacing devices to Modelica environments [Thi19]. The library conveniently allows the definition of a fixed interval for data exchange that can be different from the simulation time step. More on this choice and the integration can be found in [Mir+18]. Figure 2.5 depicts the flow of time for the co-simulation and each simulator. The power grid and market simulators compute in parallel, whereas the com-

Block diagram Topology Builder of topology

Extended CIM XML representation of topology

Co-Simulation Interface mosaik, CIM++, CIMverter, etc. C++ objects from CIM++ & simulator-specific system models

Communication Modelica Models Python Objects Network Topology

Power Grid Communication Simulation Market Simulation Simulation Modelica (Dymola, Python ns-3, etc. OpenModelica)

Figure 2.4: Overall SINERGIEN architecture for simulation setup

23 Chapter 2 Multi-Domain Co-Simulation munication network is waiting for their inputs. The interactions between the simulators in each co-simulation step can be formalized by

up(n + 1) = Fc(Fm(um(n))), (2.1)

um(n + 1) = Fc(Fp(up(n))), (2.2) where uc, um and up are the corresponding input values of the simulators for the power grid, energy market and communication network for each time step. Therefore, it is required to set initial values, up(0), um(0), uc(0), at the beginning of the co-simulation. n denominates the current co-simulation time step. Fc (communication), Fm (market) and Fp (power grid) are the functions describing the calculation within a step.

2.4.4 Co-Simulation Runtime Interaction Figure 2.6 shows the coupling of the simulators for their co-simulation runtime interaction with following entities: mosaik As already mentioned, mosaik is used for the coordination during the synchronization steps of several minutes (in simulation time) regarding all simulators [Sch19]. Market Simulator Implemented in Python, it can make use of mosaik’s so-called high-level API as illustrated in Fig. 2.6. Communication Network Simulator Based on available DES tools, their network simulation modules are extended with inter-process commu- nication functionalities for message exchange with mosaik.

0 1 1 2

Power Power Power Grid Grid Grid

Communication Communication Network Network

Market 0 1 Market 1 2 Market Event 0 Event 2

0 1 Event 1 1 2 Event 3 Individual Simulator Steps Co-Simulation 0 1 2 Steps

Figure 2.5: Synchronization scheme of simulators at co-simulation time steps

24 2.4 Concept of the Co-Simulation Environment

Power System Simulator The integration of so-called TCPIP_Send/ Recv_IO blocks from Modelica_DeviceDrivers into the Modelica models, allows the exchange of simulation data via sockets but in the form of Modelica variables as bitvectors instead of messages in JSON, an open-source and human-readable data format [ecm19]. Therefore, the MODD Server is implemented.

MODD Server It receives commands from the socket connected with mosaik. Based on these commands it starts, for example, the power

Python Environment

Market Simulation

Slow phenomena communication TCP Sockets

TCP Sockets TCP Sockets TCP Sockets

mosaik- MODD Server core Communication TCP Sockets Simulation

DES TCP Sockets Environment Modelica InfiniBand DeviceDrivers

InfiniBand Power Grid Simulation VILLAS- node Modelica Shared Memory Shared Memory DeviceDrivers Modelica Environment Fast phenomena communication

Figure 2.6: Scheme of runtime interaction between co-simulation compo- nents

25 Chapter 2 Multi-Domain Co-Simulation

grid simulator or receives the bitstream from Modelica_DeviceDrivers and encapsulates it into JSON messages before transferring them to mosaik. Besides the synchronization steps controlled by mosaik, there will be also more fine-grained synchronization steps of fractions of seconds between the power grid and communication network simulator. That is why a VILLASnode gateway is included.

VILLASnode Instead of the Transmission Control Protocol (TCP) as in case of mosaik, VILLASnode can make use of InfiniBand (IB) interconnects for data exchange between real-time simulators on different machines and shared-memory regions on the same machine. The use of shared-memory regions and IB interconnects leads to lower latencies and consequently to shorter synchronization time steps as shown in Chap. 8.

For more on the formalization of the SINERGIEN co-simulation and the limitations of the environment please refer to [Mir+18].

2.5 Validation by Use Case

The proper functioning of the SINERGIEN co-simulation environment has been validated with the aid of different use case scenarios. In the use case presented in [Mir+18] there is the assumption that a VPP operator tries to reduce the VPP’s peak power. This behavior could be desired by the responsible DSO and come with financial incentives. Therefore, a peak- shaving algorithm is utilized for an optimal management of distributed battery storage systems. First of all, simulation results which are obtained without the SIN- ERGIEN environment were compared with results obtained with the SIN- ERGIEN environment for demonstrating that the results do not change under the assumption of an ideal communication network when simulating the same scenario. Furthermore another scenario was presented where the communication network was supposed to impair the control loop between the power grid and the market due to communication device failures. More on the details of the co-simulated scenarios again can be found in [Mir+18] as the simulations themselves are not in focus of this work. Anyway, with the simulation results of both scenarios a proper functioning of the co-simulation environment has been shown.

26 2.6 Conclusion

2.6 Conclusion

The here presented architecture of the implemented multi-domain co- simulation environment shows the applicability of the CIM-based holistic data model for smart grid simulations which include the three domains: power grid, communication network, and market. The data model facili- tates the use of the software environment since the domain-specific smart grid component parameters and their interconnections can be modified and stored in a self-contained topology description. Due to the SINERGIEN co-simulation approach the user can take advantage of established domain- specific simulators for each domain. For this purpose, also new software tools have been developed. The ModPowerSystem library can be used for scientific research on various models since Modelica as modeling language simplifies the development and improvement of component models. Because of the increasing use of CIM- based documents for grid topology representation, the choice of Modelica lead to the development of a CIM to Modelica mapping that is presented in Chap. 4. Besides the initiation of the CIM related topics (Chap. 3 and Chap. 4), the SINERGIEN co-simulation architecture illustrates how the work on HPC Python (Chap. 7) and the integration of InfiniBand in VILLAS Chap. 8 can be used in power system co-simulation. The work in Chap. 5 and Chap. 6, however, contributes to a higher performance of the simulation itself which is accomplished by the simulators of the co-simulation environment. In the following chapter, the automated generation of a (de-)serializer for reading and writing CIM-based documents, implemented in the mentioned CIM++ software library, is presented.

27

3

Automated De-/Serializer Generation

Due to growing automation in smart grids with the aid of an increasing digitalization and a rising number of decentralized energy systems, the actors in this area are increasingly dependent on ICT systems that must be compatible with each other, which in particular concerns the data exchange between these systems. Therefore, different countries, organiza- tions, and vendors started to develop smart grids related standards with different focus on technical and economical aspects. Eventually, only few national standards have been integrated into standards of the International Electrotechnical Commission (IEC) or the International Organization for Standardization (ISO) [Usl+12]. In recent years, the CIM standards (IEC 61970/61968/62325, see Sect. 2.1.2) have been subject to numerous research activities often related to use cases for CIM [MDC09; DK12; Wei+07]. Some of them, like in the research project SINERGIEN, also introduce extensions by classes not included in original CIM as, for instance, in [MMS13] where a method- ology for modeling telemetry in power systems using IEC 61970/68 in the case of a US independent system operator is presented. There are also harmonization approaches because of data exchange between energy related software systems based on CIM standards and ICT for substation automation based on IEC 61850 [LK17; Bec10; SRS10]. Since CIM is object-oriented, it specifies classes of objects containing information about energy system aspects as well as relations between these classes (referred to as the ontology)[GDD+06]. Currently more

29 Chapter 3 Automated De-/Serializer Generation and more commercial software tools in the energy sector provide import and export of CIM documents. Moreover, there are already about 200 corporate members organized in the CIM User Group (CIMug) providing CIM models for common visual Unified Modeling Language (UML) editors [CIM]. This high acceptance among companies and institutions has pushed the adoption of CIM also in the simulation environment as presented in Chap. 2 with respect to the SINERGIEN co-simulation environment, where the data format of the multi-domain component-based co-simulation model (referred to as the topology) is based on CIM. As the topology, including power grid and communication network components as well as energy market actors, evolves continuously, a high compatibility, updatability, and extensibility of the chosen data model are key requirements. The object-oriented design with concepts such as inheritance, associa- tions, aggregations, etc. led to a CIM data encapsulation format referred to as RDF/XML [IEC06] coming from the area of semantic web [AH11] and not as common in other domains. This and the huge specification of CIM with hundreds of classes and relations between them, making CIM very extensible and universal applicable in comparison to other more specific and static data models, have a deterrent effect to new users. However, keeping CIM based software up-to-date continuously can be too effortful, especially in the scientific and academic area. These could be the reason why there are hardly any software libraries especially for handling CIM documents. Therefore, in this chapter an automated (de-)serializer generation from UML based CIM ontologies is presented. The approach was introduced in [Raz+18a] and implemented in a chain of tools for generating an open- source library libcimpp within the CIM++ software project [FEI19a]. libcimpp can be used for reading in CIM RDF/XML documents directly into CIM C++ objects (called deserialization) and is currently also ex- tended for serialization (i. e. writing of CIM C++ objects from the memory to RDF/XML documents). Due to a model-driven architecture (MDA), libcimpp can be adapted to new CIM versions or user-specified CIM based ontologies in an automated way. For this purpose, the approach makes use of a common visual UML editor and our CIM++ toolchain generating a complete compilable CIM C++ codebase of given CIM UML models (i. e. CIM profiles) which are kept up-to-date (e. g. by the CIMug). It is also shown how this CIM C++ codebase can be used for holding the deserialized CIM objects as well as for an automated generation of C++ code for exactly this deserialization. Hence, if the CIM C++ codebase changes (because of changes in the CIM UML), there is no need to adapt code for libcimpp by hand.

30 3.1 CIM Formalisms and Formats

The direct deserialization into C++ objects makes the library very easy to apply because its user does need neither any CIM RDF/XML knowledge nor have to handle intermediate representations of the CIM RDF/XML document like a Document Object Model (DOM) in combination with the Resource Description Framework (RDF) syntax. For instance, in case of a power grid topology stored in CIM documents, a power grid simulator can directly access the CIM objects, deserialized by libcimpp, in form of common C++ objects. The chapter gives a short introduction to data formats as well as other components used in CIM++ followed by an overview of the overall concept. Then it explicitly describes how Common Information Model (CIM) is au- tomatedly mapped to compilable C++ code which is used by the CIM++ Deserializer (i. e. libcimpp) during the so-called unmarshalling step ex- plained subsequently together with its automated generation. Following this, the final libcimpp is introduced. Finally, the chapter is concluded by a roundup and an outlook of future work. The work in this chapter has been partially presented in [Raz+18a]1.

3.1 CIM Formalisms and Formats

An introduction to CIM is provided by Sect. 2.1.2. CIM makes use of several formalisms and formats which are explained in the following.

UML UML is a well-established formalism for graphical object-oriented modeling [RJB04]. In CIM only UML class diagrams with attributes and inheritance as well as associations, aggregations, and compositions with multiplicities are used. The CIM UML contains no class methods as CIM defines just the semantics of its object classes and their relations without any functionality of the objects just to specify which kind of information a CIM object contains. CIM UML diagrams can, as other UML diagrams, be edited by visual UML editors and stored in a proprietary or open format like XML Metadata Interchange (XMI) [KH02]. Conveniently, the CIMug provides such CIM model drafts [CIM]. While UML resp. XMI is used for the definition

1 Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Computer Science – Research and Development (“Automated deserial- izer generation from CIM ontologies: CIM++—an easy-to-use and automated adaptable open-source library for object deserialization in C++ from documents based on user-specified UML models following the Common Information Model (CIM) standards for the energy sector“, Lukas Razik, Markus Mirz, Daniel Knibbe, Stefan Lankes, Antonello Monti), © (2017)

31 Chapter 3 Automated De-/Serializer Generation of all classes with their attributes and relations among them, the actual objects (i. e. instances of these classes) are stored in form of RDF/XML documents.

XML and RDF

The Extensible Markup Language (XML) is a widely used text-based for- malism for human- and machine-readable documents [Bra+97]. In general, XML documents have a tree structure which is why XML itself is not well suited for representing arbitrary graphs. Therefore it is combined with the RDF [Pan09]. RDF provides triples of the form “ ” which allow representing a relation () between resources ( and ). Therefore, links (i. e. instances of associations, aggregations, . . . ) between CIM objects, as specified in the UML ontology, can be expressed by RDF/XML. For instance, in List. 3.1 the object of class BatteryStorage has an rdf:ID (line 7) which is referenced in the Terminal (line 5) with the RDF/XML attribute rdf:resource="#BS7". A brief introduction to CIM with its key concepts is provided by [McM07].

XML Parsers

There are three common types of pure XML parsers [Fri16; HR07; KH14]. During parse time, the so-called DOM parser generates a treelike structure

Listing 3.1: Snippet of a CIM document representing an IEEE European Low Voltage Test Feeder with an additional BatteryStorage

1 2 T1 3 4 ... 5 6 7 8 9 Battery -1 10 11 5000 12 13 400 14 15 ... 16

32 3.2 CIM++ Concept with strings of the whole document what can be very memory demanding. For further processing, the particular strings have to be picked out manually and interpreted respective converted to desired data types. To avoid loading a whole document into memory, StAX parsers (a kind of pull parsers [Slo01]) can be used. They are a compromise solution between DOM and Simple API for XML (SAX) parsers as they allow random access to all elements within a document. SAX parsers are most commonly used. They traverse XML documents linearly and trigger event callbacks at certain positions. Because of the fact that one linear reading of the CIM document is sufficient for its deserialization, a SAX parser is used.

C++ Source Code Analysis

For C++ source code analysis, correction, and adaption, which is needed in several steps of the automated generation, a compiler front-end was chosen. It can transform source code into a so-called abstract syntax tree (AST) [Aho03]. With further functionalities provided by the compiler front-end, e. g., static code analysis [Bou13] and source code manipulations can be performed. One of the conceptional ideas is to use the data from the AST as input for a template engine.

Template Engines

Template engines are mainly used for generation of dynamic web pages [Fow02; STM10]. The core idea behind them is to separate static context (e. g. HTML code defining the structure of a web page) from dynamic data (e. g. the actual web page content). Therefore, the static part can be written in template documents with place holders filled by the template engine with data from a data base as described in Sect. 3.4.4.

3.2 CIM++ Concept

An conceptual overview of the automated (de-)serializer generation from CIM UML is presented in Fig. 3.1. The upper part of the diagram shows the automated code generation process from the definition of the ontology by CIM UML to the (un-)marshalling code generation of the CIM++ (De-)Serializer libcimpp. The lower part shows the deserialization process from a given topology (based on the specified CIM ontology) to CIM C++ objects. The CIM based specification, which represents classes and their relations in UML, is loaded with a visual UML editor and transformed to a C++ codebase. Before this C++ codebase can be included by the (de-)serializer’s source-code (i. e. libcimpp), it is adapted by the developed

33 Chapter 3 Automated De-/Serializer Generation

Visual UML Editor CIM UML Ontology

CIM C++ Codebase

Unmarshalling Template Templates Engine DB Compiler Front-End

CIM++ Code Toolchain

Template Engine

Adapted CIM C++ CIM++ (Un-)Marshalling Generator Codebase

Compiler Front-End

Topology Editor CIM based Topology

CIM XML/RDF Topology Document(s) (Un-)Marshalling

CIM++ (De-)Serializer libcimpp CIM C++ Topology Objects

Figure 3.1: Overall concept of the CIM++ project

34 3.3 From CIM UML to Compilable C++ Code

CIM++ code toolchain to compilable C++ code as the original CIM C++ codebase is not complete as explained later. This adapted codebase CIM++ (Un-)Marshalling Generator for unmarshalling code generation needed for the CIM++ (de-)serializer. Originally only a deserialization was implemented in libcimpp but a serialization is currently being implemented as this concept can be applied for both directions. The code toolchain as well as the (un-)marshalling generator make use of a compiler front-end and the latter make also use of a template engine getting its data from abstract syntax trees created by the compiler front-end while reading in the adapted CIM++ codebase. Afterwards, the template engine can fill the data about the codebase into the (un-)marshalling code templates. After all these automated steps, which can be repeated whenever the CIM based specification in UML form is visually modified, the CIM++ deserializer can be compiled to a library. This CIM++ (de-)serializer library (libcimpp) can be used by C++ programs for reading (by deserialization of C++ objects) and writing (by serialization of C++ objects) CIM documents. In the shown topology editor screenshot, for instance, all components of a grid with their links (i. e. the grid’s topology corresponding to the previously defined CIM spec- ification) are stored by a topology editor in one or more CIM RDF/XML documents. These documents can be directly transformed to C++ objects by libcimpp. C++ was chosen as programming language because of its high exe- cution time and memory space efficiency and to be directly compatible with programs written in C++. Before the automated generation of (un-)marshalling code can be introduced, the mapping of the CIM UML specification to the adapted and therefore compilable C++ codebase is presented.

3.3 From CIM UML to Compilable C++ Code

With visual UML editors the CIM model can be rapidly modified or extended to individual requirements. Moreover, many tools follow MDA approaches making round trip engineering (RTE) possible. RTE in relation to UML allows the user to keep UML models and related source codes consistent by two software engineering principles: Forward engineering, where changes to UML diagrams lead to an automated adaption of the belonging source codes. And reverse engineering (if supported by the UML editor), where changes to source codes lead to an automated adaption of the belonging UML diagrams [Dav03; Reu+16]. In our case, these principles provide the ability for incremental development of CIM ontologies

35 Chapter 3 Automated De-/Serializer Generation

1..* 0..1 0..* 0..1 HydroPump HydroPowerPlant Reservoir

Figure 3.2: UML diagram of HydroPowerPlant class which instances can be associated with no more than one Reservoir instance

(respective data models based on CIM) and the automated generated CIM C++ codebases. This leads to better software documentation and compatibility between different (distributed) software developing entities. Unfortunately, there are no standardized canonical mappings between UML associations, aggregations, composition, etc. on the one side and object-oriented programming (OOP) languages on the other one. Therefore, different C++ code representations for CIM UML aspects had to be chosen for the code generation. For instance, in case of no multiplicity, the chosen representation of an association is a pointer to an instance of the associated class. In case of a possible multiplicity greater than 1, it is a pointer to an Standard Template Library (STL) list of pointers to instances of the associated class. The CIM UML specification of the HydroPowerPlant class is partly presented in Fig. 3.2. Since the given multiplicity of the aggregated HydroPump objects can be greater than one, for the belonging HydroPumps attribute in the generated code a list is used as depicted in List. 3.2. The HydroPowerPlant aggregates one or more HydroPump instances. Further- more, there can be multiple HydroPowerPlant instances associated with a Reservoir and a HydroPump can also exist without being aggregated by a HydroPowerPlant. Inheritance in CIM UML can be easily represented by C++ inheritance. Due to the fact that no operations are defined in CIM UML, i. a. the

Listing 3.2: Snippet of HydroPowerPlant class

class HydroPowerPlant : public IEC61970::Base::Core::PowerSystemResource { public : std::list *HydroPumps; IEC61970::Base::Generation::Production::Reservoir *Reservoir; ... };

36 3.3 From CIM UML to Compilable C++ Code generated standard constructors are empty, there are no further class functions, and all UML-defined attributes stay uninitialized. The classes defined in the CIM standard as primitive types (see also Sect. 3.3.3), are generated as empty classes. In case of the used code generator and highly likely most others, the generated enum types are not strongly typed and therefore have no scope. Besides these circumstances, due to the CIM UML standards in conjunction with C++, the generated code also comes with some software technical deficiencies. For instance, the #include directive for the chosen std::list container is not automatically inserted etc. The mentioned facts lead to source code files that could not be directly used for the subsequent automated (un-)marshalling code generation. Therefore following solution approaches were also considered: Replacing C++ by any other programming language would not guarantee a solution of any mentioned issues. Writing a new generation tool would result in an additional sophisticated software project just for the special case of generating C++ code from a machine readable CIM UML representation such as XMI. Therefore, a cost-benefit analysis led to the decision to develop a toolchain for automated code correction and adaption based on existing widely used transformation and C++ refactoring techniques. Thus, the developed toolchain should be easily adaptable for the usage with different general purpose CIM UML to C++ code generators. A source code transformation by hand, e. g., in case of IEC 61970 / 61968 on around 2000 source files, would be an cumbersome and error-prone task. The demands on the generated code after its transformation by the toolchain are hierarchical includes of all header files as well as an adequate usage of the chosen container class (i. e. std::list). Furthermore, a common BaseClass for all CIM classes is needed as it will be shown later.

3.3.1 Gathering Generated CIM Sources

The first steps are performed by the CIM-Include-Tool, grouping all C++ source files together that are created by the code generator of the vi- sual UML editor from CIM UML. The tool scans all source files written by the code generator for the container class that was chosen for asso- ciations with multiplicities and adds missing header includes (here i. e. #include ). In case of the used code generator, all files are grouped together according to the CIM packages. For instance, the definition of the IEC 61970 class Terminal, located in the package Base::Core, is be stored in the directory path IEC61970/Base/Core. This is why all occurrences of # include " Terminal .h" are transformed to

37 Chapter 3 Automated De-/Serializer Generation

# include " IEC61970 / Base / Core / Terminal .h" for keeping the hierarchical structure of all directories and files [Daw; ISO14].

3.3.2 Refactoring Generated CIM Sources

After that the CIMRefactorer based on the Clang LibTooling library is executed. Clang is a compiler front-end supporting C/C++, developed within the LLVM compiler infrastructure project [LLV]. During parse time, the library creates an AST containing objects that represent the whole source code like declarations and statements [DMS14]. For an AST traversal the visitor design pattern [KWS07] is used to evaluate and process the AST. Due to the usage of a visitor pattern, the implementation of the so-called composite does not need to be adapted if its processing has to be changed or extended. If a new action has to be performed on the AST, a new visitor has to be implemented only. Clang provides the class tem- plate clang::RecursiveASTVisitor for this need. It is derived with an appropriate implementation of a visitor class given as template parameter as pictured in Fig. 3.3 for an example visitor MyASTVisitor. By design there also exists the MyASTConsumer class inheriting from clang ::ASTConsumer which determines the entry point of the AST. It calls the TraverseDecl() of the AST visitor which then calls the appropriate methods of the given MyASTVisitor. The CIM models provided by the CIMug include UML enumerations, e. g., for units and multipliers. Thereby, several enumerations contain same symbols. For example, the enumeration UnitSymbol contains the enumerator m as unit for meters while the enumeration UnitMultiplier contains the enumerator m as SI prefix for milli. Since C++ requires unique

MyASTVisitor clang::RecursiveASTVisitor

TraverseDecl(Decl* D) clang::ASTConsumer HandleTopLevelDecl(clang::DeclGroupRef DR)

MyASTVisitor 1 1 MyASTConsumer VisitDecl(Decl* D) std::unordered_set Locations VisitStmt(Stmt* S)

Figure 3.3: UML diagram of the class MyASTVisitor

38 3.3 From CIM UML to Compilable C++ Code symbols which is not true for the symbol m declared twice, the generated code with unscoped enum types is incorrect. Therefore, VisitDecl() i. a. adds the class keyword to all visited unscoped enumerations. However, this does not define them as classes, it is just a reuse of an existing C++ keyword. Furthermore, the visitor checks each statement if any used data type is an enumeration and adds its corresponding scope as prefix. Hence, e. g. the unscoped enumeration with a corresponding assignment statement enum UnitSymbol { F,...} ... const IEC61970 :: Base :: Domain :: UnitSymbol Capacitance :: unit = F; is adapted by VisitDecl() to a strongly typed enumeration enum class UnitSymbol { F,...} ... const IEC61970 :: Base :: Domain :: UnitSymbol Capacitance :: unit = EC61970 :: Base :: Domain :: UnitSymbol ::F; with the needed scope in the assignment statement. Such modifications are not performed by the visitor on existing code directly but temporarily stored in the designated container provided by Clang for later usage to avoid invalid ASTs. Furthermore, initialization lists are added to the standard constructors for all class attributes which are provided by Clang with their belonging data types. Also, all pointer operators * to the chosen container type in case of associations with given multiplicities are removed. The list of attributes with their data types specified in the visited class declaration is also provided by Clang. Thus, such associations are finally represented as lists of pointers. std :: list < IEC61970 :: Base :: Generation :: Production :: HydroPump *> HydroPumps ;

Almost all thousands of CIM headers include other headers which would lead to many repeatedly visited declarations and consequently to very long execution times. As already mentioned, MyASTConsumer defines the entry points of the AST which are the top-level declarations of the CIM C++ headers. A top-level declaration is not included in another declaration. Hence, each top-level declaration is traversed in order to visit all nodes of the AST. During this, the position of each node in the source code is stored in a hash table with an average-case time complexity for search operations of O(1). As a result, in case a declaration’s position is already contained in the table, the declaration is ignored.

39 Chapter 3 Automated De-/Serializer Generation

3.3.3 Primitive CIM Data Types

The CIM standards do not only define classes for virtual instances of real objects but also so-called primitive (data) types String, Integer, Float, Decimal, and Boolean which correspond to intrinsic data types of many programming languages. All other CIM data types are classes that can contain these primitive types. For the CIM classes representing such primitive types just empty skeletons are generated with the result that they must be implemented depending on their aim which can differ between different CIM respective libcimpp users. In the used CIM model there is also the Decimal type which is not specified as primitive (in present CIM standards) but used like the four others and is thus handled by the toolchain like a primitive type. Thus, two different methods for the implementation of primitive types have been discussed: simple type definitions, e. g., with typedef on intrinsic C++ data types, and the implementation of C++ classes. For the unmarshalling step (explained in Sect. 3.4.3) it is mandatory that the class attributes provide reading from C++ string streams. Since a design decision was to throw exceptions while trying to read from never defined CIM class attributes, primitive types were implemented in form of classes (and not, e. g., just typedefs on intrinsic C++ types). Moreover, in case of numeric data types a sufficient precision can only be guaranteed since C++11 which is the already used standard for CIM++ because of other language features. The primitive String type is based on std::string since it can store indefinitely many UTF-8 encoded strings as required by the standard. The integral type Integer is implemented based on long which size depends on the used platform but usually is 32 bit which should be sufficient in most cases. Float is CIM’s floating-point numbers type for which the double type was chosen instead of float as a sufficient accuracy in case of CIM is more important than a higher runtime performance. All these types already provide read in from streams. Boolean is based on bool which also provides read in from streams but only in case of the digits 0/1 and not in case of the words true/false as used in CIM RDF/XML documents. Therefore, it was implemented with appropriate stream and cast operators which make it i. a. comparable also with other types. Decimal was implemented based on std::strings to keep the read value as it is because of the standard’s requirement that it should be able to represent a base-10 real number without any restriction on its length. Afterwards, it can be converted by the libcimpp user, e. g., into an arbitrary-precision representation such as provided for example by the

40 3.4 Automated CIM (De-)Serializer Generation

Multiple-Precision Binary Floating-Point Library With Correct Rounding (MPFR) [Fou+07].

Overall CIM C++ Source Code Transformation In addition to the previously described procedures and the implementation of primitive data types in the form of a patch, also a couple of code fixing patches are applied on the generated CIM C++ code. Besides software- technical details like a correction of definitions inside the IEC61970Version class or making all source files Portable Operating System Interface (POSIX) conform [IEE18], also some conceptional issues have to be solved. As the CIM standards define an enumerated type for three-digit currency codes of ISO 4217 which can have a leading 0, they are interpreted in C++ code as octal numbers which is why such leading zeros must be removed. Moreover, CIM defines an attribute switch which in C++ is a keyword. Therefore, the attribute is renamed which must be taken into account during the unmarshalling step of the deserializer later on for reading in the attribute by its original name. Afterwards, the code is checked for its compilability with clang-check, what could be done by any C++ compiler, too. If the check is successful, the code can be used for the automated CIM++ (De-)Serializer generation for detecting errors when the CIM standard is changed or extended with the aid of a visual UML editor. Finally, the documentation generator doxygen is applied on the now compilable CIM C++ code as support for the CIM++ user [FEIe].

3.4 Automated CIM (De-)Serializer Generation

With a UML to source code generator and the previously introduced toolchain, the CIM UML model is transformed to a compilable codebase which can be used as a CIM data model with instantiatable C++ objects. These objects then can be filled with data read from a CIM RDF/XML doc- ument by a common XML parser with the aid of automatically generated unmarshalling routines. Or the objects can be filled / modified by C++ statements and serialized into a CIM RDF/XML document. However, for being able to store CIM C++ objects in (e. g. STL) containers, further work needs to be done.

3.4.1 The Common Base Class During reading of CIM RDF/XML documents, the thereby created CIM objects are stored on the heap and therefore referenced by pointers which

41 Chapter 3 Automated De-/Serializer Generation are collected in the a list container. Due to the fact that STL containers store items of one single type, the concept of base class pointers is used. This means that objects of derived classes can also be referenced by pointers of their base classes. The motivation is that not all CIM classes inherit from the CIM class IdentifiedObject. Due to the absence of a common base class for all CIM classes, it is not possible to collect them in a container of objects of one base class. Several solutions for solving this issue have been discussed. One possibil- ity would be to use typeless pointers (void*) but as C++ is a strictly typed language, this was rejected. Another possibility is the use of a container type (like boost::any from the Boost.Any Library [Hen]). But to remain on STL, keeping it simple for the CIM++ user, and for avoiding further software dependencies, each top-level CIM class (i. e. a class without super class) derives from a newly introduced BaseClass. As a consequence, it is the base class for all CIM classes and thus is added to the CIM C++ codebase by the previously introduced CIMRefactorer.

3.4.2 Integrating an XML Parser Basically, CIM RDF/XML can be read by each XML parser. As already described in Sect. 3.2, RDF extends XML i. a. by the possibility within an XML element of referencing other elements. There are a couple of libraries for RDF handling such as Apache Jena for Java [Apa] and Redland RDF Libraries written in C [Bec]. The relevant Redland libraries are librdf, the actual RDF/XML parser, and libraptor, providing the data access by RDF triples. Redland’s implementation is similar to a DOM parser. Data from RDF/XML documents is stored in an own container residing in the main memory. However, the main goal of CIM++ is to deserialize the CIM objects stored in RDF/XML into C++ objects. Consequently, all CIM data already stored in an intermediate format would need to be copied into the objects instantiated accordingly to the defined CIM C++ classes. Therefore, the choice fell on a SAX parser which, with a succeeding unmarshalling step, can directly fill the read CIM RDF/XML data into the CIM C++ objects. The first versions of the developed libcimpp library were using the event- based SAX parser of libxml++ [Cum] which is a C++ warapper for the well-established libxml library. In the current libcimpp version, libxml++ [Cum] was replaced by the Arabica XML Toolkit which comes with uniform SAX wrappers for several XML parsers [Hig] making libcimpp usable on different Unix-like operating systems as well as on Windows. All event- based SAX parsers provide callback functions called whenever during parse time a certain event occurs. In case of libcimpp these methods call the

42 3.4 Automated CIM (De-)Serializer Generation unmarshalling code which instantiates proper CIM C++ objects and fills them with the read data. Whenever a new opening XML tag is encountered, a startElement is called which gets the XML tag and its attributes that will be stored for later use. If the tag represents a CIM class, an object of this class is instantiated on the heap and referenced by a BaseClass pointer which is pushed onto a stack and later on popped from the stack by a call of endElement. If an opened XML tag contains an RDF attribute, which refers to another CIM object, a task is created and inserted into a task queue. This can be the case in all kinds of CIM UML associations. Finally, the end of the document endDocument is called which processes all tasks of the task queue. These tasks connect associated objects by pointers. If an opened XML tag contains an RDF attribute, which refers to another CIM object, a task is created and inserted into a queue. This can be the case in all kinds of CIM UML associations. These tasks then connect associated CIM objects by pointers. Therefore all objects of the CIM document have to be instantiated before their pointers can be set correctly. Furthermore, a certain routine is called whenever the SAX parser encounters characters which represent no XML tag. These characters and the uppermost element of the tag and the object stack is passed to an assignment function which interprets the characters to values and tries to assign them to the proper attributes of the belonging CIM object.

3.4.3 Unmarshalling

The previously introduced assignment functions form the core functions of the unmarshaller. Since the CIM UML model is transformed into a correct compilable CIM C++ codebase, it is possible to map XML elements with their contents into the previously instantiated CIM C++ objects. For this purpose, a proper mapping function was defined which will be exemplarily described by the CIM RDF/XML snippet shown in List. 3.1. For instance, a function has to assign the name of the Terminal element (List. 3.1 line 2) to the name attribute of the corresponding C++ object which is an instance of the Terminal class that inherits the attribute from IdentifiedObject whose code snippet is shown in List. 3.3. The simplest way in general would be using reflection of the programming language which is the ability to examine, introspect, and modify its own structure and behavior at runtime [DM95]. Reflection in OOP languages i. a. allows “looking” into an object. For instance, that would allow the program to check if a certain object has the attribute name and access it at runtime. Usually, it would also be possible to iterate through all attributes of an object, entirely independently of their types. Contrary to dynamic

43 Chapter 3 Automated De-/Serializer Generation

Listing 3.3: Snippet of the CIM C++ class IdentifiedObject.

class IdentifiedObject { public : IdentifiedObject (); IEC61970 :: Base :: Domain :: String name ; ... };

Listing 3.4: Assignment function for IdentifiedObject.name

bool assign_IdentifiedObject_name ( std :: stringstream & buffer , BaseClass * base_class_ptr ){ if( IEC61970 :: Base :: Core :: IdentifiedObject * element = dynamic_cast < IEC61970 :: Base :: Core :: IdentifiedObject *>( base_class_ptr )) { buffer >> element -> name ; ... }

programming languages such as Python which provide reflection and also object runtime alternation [ŠD12; Chu01], C++ by design provides only very limited reflection mechanisms. Without additional programming effort only information like, e. g., the object’s type identifier can be queried at runtime which is no solution in this context. There are methods to extend C++ by reflection mechanisms with the aid of libraries adding meta information but such an approach would increase the complexity of the CIM++ project significantly, add further dependencies, and also deteriorate its maintainability and flexibility. Instead, Clang’s LibTooling is used for generating the mapping functions based on information provided by the previously adapted CIM C++ codebase. A mapping function needs the object whose attribute has to be accessed, the attribute’s name, and the character string which has to be interpreted and assigned to the attribute. In List. 3.1 line 2 the attribute is iden- tified by cim:IdentifiedObject.name, where cim is the namespace. By implication, a mapping function calls an appropriate assignment function which, for the given case, is presented in List. 3.4. If the dynamic_cast is successful, the stream operator which was previously implemented for all primitive types, is used for interpreting the given characters to the proper value and its subsequent assignment to the attribute.

44 3.4 Automated CIM (De-)Serializer Generation

In addition to primitive types, there are also CIM classes which are no data types but in CIM based CGMES documents [ENT16] similarly used and in context of OOP called structured data types. Apart from a value attribute these classes just contain members representing enumer- ated types, units, or multipliers. CIM’s Domain package defines most of these classes such as Base::Domain::Voltage with the attributes value of the type IEC61970::Base::Domain::Float, multiplier of the type Base::Domain::UnitMultiplier, and unit of the type Base::Domain:: UnitSymbol. Accordingly to the presented assignment function, the assign- ment for an attributenominalVoltage of the type Base::Domain::Voltage would be: buffer >> element -> nominalVoltage . value ;

Since for all such attributes there have to be similar assignment functions implemented, they are generated with the aid of a template engine by the unmarshaller generator explained in Sect. 3.4.4. In case of IEC 61970 only, there are more than 3000 assignment functions generated. To find the right one by if-branches at runtime would lead to an average-case time complexity of O(n) for each assignment, with n being the total number of assignment functions. For improving the performance, a kind of dynamic switch statement was implemented. For this, pointers to all assignment functions are stored in a hash table with the attributes’ names as keys. Therefore, lookups in the hash table have an average time complexity of O(1). Before any assignment can take place, the proper objects have to be instantiated. As already described this happens when a new opening XML tag is encountered. In case of a new object on the heap is instantiated by new Base::Core::Terminal. The mapping of such an XML tag to its line of code is also done with the aid of the dynamic switch statement concept. For each CIM class, there is a function instantiating respective objects. These functions are part of a Factory design pattern [Ale01] implemented in the CIMFactory class which is part of libcimpp. The object’s rdf:ID is stored as key value in a hash table together with a pointer to the object for later task resolving. The Task class has a resolve() method which is called for setting the association between objects as mentioned before. During construction, a Task instance gets the CIM object which represents the end of the regarding association together with the association’s identifier. The identifier is the XML tag belonging to the association. To resolve a task in resolve(), the rdf:ID is looked up in the hash table for getting the address of the associated CIM object. Afterwards, a set of assignment functions is used to link the objects together.

45 Chapter 3 Automated De-/Serializer Generation

3.4.4 Unmarshalling Code Generator

In the previous section the unmarshalling process of libcimpp was described. The developed CIM-Unmarshalling-Generator application generates C++ code for the introduced classes Task and CIMFactory as well as for the assignment functions. The step is performed with the aid of the CTemplate engine [Spe]. Each template engine needs a data source for template file rendering. To be most independent from any tools, no proprietary format containing the CIM model was used. It would be possible to export the available CIM model to an open format like XMI but this approach was rejected for different reasons: as the code generation, the XMI export of the used visual UML editor can have inadequacies, too. The present corrected and adapted CIM C++ codebase already contains all needed information about the given CIM model and can be used as input for the template engine’s data base. Thus, subsequent manual changes to the CIM C++ codebase can also be considered by the CIM++ toolchain. Therefore, the data base needed for the template engine is built from the CIM C++ codebase. As already mentioned, the introduced class CIMFactory creates instances of CIM classes that were requested by their names. Therefore, appropriate functions are needed for each CIM class. These functions can be expressed by a template snippet presented in List. 3.5. There, {{#FACTORY}} begins a reiterative section and {{CLASS_NAME}} as well as {{QUAL_CLASS_NAME}} are place holders which are replaced at render time by values read from the data base. Based on this template, the CIM-Unmarshalling-Generator will create the appropriate function for each CIM class. Therefore, the ASTVisitor creates a section dictionary for each class definition it finds in the CIM C++ files since the CTemplate engine works with dictionaries to set the place holders at render time. The final code for Terminal after the so-called rendering by the template engine is shown in List. 3.6. The AST visitor also has access to a whitelist in which all CIM classes that are used like data types (i. e. they just occur in attribute declarations

Listing 3.5: Snippet of CIMFactory template

{{# FACTORY }} BaseClass * {{ CLASS_NAME }} _factory (){ return new {{ QUAL_CLASS_NAME }}; } {{/ FACTORY }}

46 3.4 Automated CIM (De-)Serializer Generation

Listing 3.6: Automated generated Terminal_factory()

BaseClass * Terminal_factory (){ return new IEC61970 :: Base :: Core :: Terminal ; }

of other CIM classes) are listed and, as a consequence, are not being directly instantiated. For these classes no sections are generated. The function which initializes the hash table of the CIMFactory is also part of the shown template with the aid of the created section dictionaries. The template for Task contains sections for attributes of a pointer type or a list of pointers (in case of given multiplicities greater than 1). Although in CIM associations are generally developed in form of bidirec- tional links, in typical CIM RDF/XML documents they are implemented as unidirectional relations. Therefore, this is analogously done with CIM C++ objects. An example is the association of the class Terminal with ConnectivityNode. In C++ this association is realized as a pointer at- tribute of Terminal. In CIM RDF/XML documents it is realized in form of the tag cim:Terminal.ConnectivityNode with an RDF reference to the RDF ID of an object of the class ConnectivityNode. For resolving a corre- sponding task, a function is needed which assigns the address of the object, referenced by the given RDF ID, to the attribute ConnectivityNode of the class Terminal. Compositions are not used in the available CIM model but there are many aggregations which are unidirectional, too. Nevertheless, the CIM C++ implementation of aggregations expressed in CIM RDF/XML are not that straight forward. In C++, the aggregating object contains an attribute of the type pointer or a list of pointers which show(s) on the aggregated object(s). The XML document, however, contains XML tags which are part of elements embedded in the aggregated objects. These aggregated objects contain RDF references to their aggregating object. Therefore, functions are needed which assign pointers to the aggregated objects to the pointers or list of pointers of the aggregating objects. The AST visitor generates an assignment function for each pointer or list attribute of the CIM C++ classes. These functions get base class pointers as argument to the objects which have to be linked together. The lookup of the proper function will be accomplished by another hash table. The main issue is the generation of the function which initializes the hash table with the correct XML tags as keys to the function pointers.

47 Chapter 3 Automated De-/Serializer Generation

In some cases the association representation is rather simple. Exemplar- ily, for Terminal with the attribute ConnectivityNode, the AST visitor generates the key value cim:Terminal.ConnectivityNode. This is ex- pressed by the following template: cim :{{ CLASS_NAME }}.{{ FIELD_NAME }} In other cases (depending on the CIM UML specification), the gener- ation of correct key values is different. For instance, TopologicalNode aggregates one or more instances of the CIM class Terminal but the XML tag representing the association (here an aggregation) is writ- ten the other way round (therefore called inverted XML tag), namely cim:Terminal.TopologicalNode. Therefore, in the case of the C++ class TopologicalNode with the attribute Terminal, which represents the asso- ciation, the key value can be expressed by the template: cim :{{ FIELD_NAME }}.{{ CLASS_NAME }} This proceeding is sufficient in the very most cases but in some CIM documents the XML tag representing the association looks different. There- fore, there are configuration files with proper mappings from key values generated by the previous template to the inverted XML tags representing associations in the CIM RDF/XML documents to be deserialized. These configuration files (one for primitive types and another for the remain- ing classes) are read by libcimpp at runtime. Currently there are only around a dozen such cases. With these and further template sections, the unmarshalling code of CIM++ is completed. The sections of the class Task are filled (as shown with the previous examples) by the AST visitor with the aid of further placeholders and dictionaries. Furthermore, the template for the assign function con- sists of two sections. The first one (ENUM_STRINGSTREAM) generates the unmarshalling function for enumerations and the second one the actual assignments of the read in data to the CIM C++ objects. In this unmar- shalling function the stream operators for enumerations are implemented. Therefore, for all enumerated types proper CIM RDF/XML data can be read in with the aid of streams as for primitive types. Since the enumerated types are strongly typed, besides the placeholder {{ENUM_CLASS_TYPE}} for enumerations without a scope, there is {{QUAL_ENUM_CLASS_TYPE}} for scoped enumerations. For filling these placeholders, the AST visitor traverses all enum class declarations and generates the needed section dictionaries. Finally, an ASSIGNMENT section for the assignment function contains several placeholders which are filled using section directories generated while visiting attributes of all CIM C++ classes, which are a data type or an enumeration.

48 3.4 Automated CIM (De-)Serializer Generation

Listing 3.7: serialize function of ACLineSegment

1 std :: string ACLineSegment :: serialize ( bool isXmlElement , 2 std :: map < BaseClass *, 3 std :: string > 4 * id_map ) 5 { 6 std :: string output = ""; 7 8 if( isElement ){ 9 output . append ("\n"); 11 } 12 13 output . append ( IEC61970 :: Base :: Wires :: Conductor :: 14 serialize (false , id_map )); 15 16 if( bch . value . initialized ){ 17 output . append (" " + 18 std :: to_string ( bch . value ) + 19 " \n"); 20 } 21 ... 22 23 if( isElement ){ 24 output . append (" \n"); 25 } 26 }

3.4.5 Marshalling For the serialization of CIM C++ objects from the main memory to CIM documents, BaseClass was extended by the member function virtual std :: string serialize ( bool isXmlElement , std :: map < BaseClass *, std :: string > * id_map ) that can be overridden by all CIM subclasses as they inherit all (di- rectly or indirectly through other classes) from BaseClass. For instance, ACLineSegment overrides it by the function partly depicted in List. 3.7. The isXmlElement parameter tells the serialize method if the attributes to be serialized belong to an superclass of the instance (isXmlElement = false) or to the class of the instance (isXmlElement = true). In the latter case, XML element tags (see lines 10 and 24) must be wrapped around the attributes’ marshalling output (between line 12 and 21). This means that if an instance of ACLineSegment has to be deserialized, the ACLineSegment::serialize is called with isXmlElement = true, leading to a serialization with the introductory XML line

49 Chapter 3 Automated De-/Serializer Generation rdf:ID=...>. In line 14 the serialize method of its superclass Conductor is called with isXmlElement = false to achieve an unmarshalling of the superclass’ attributes without any XML tags introducing a new Conductor object.

3.5 libcimpp Implementation

The CIM++ (De-)Serializer is implemented as a library which must be extended by CIM++ code toolchain. Afterwards, it can be easily built as a cmake project. The libcimpp library is available as an open-source project [FEI19a]. It already contains automated generated code for current CIM versions. Pointers to the deserialized C++ objects from CIM documents are provided in form of a list. Furthermore, a documentation for libcimpp is generated by Doxygen [Hee] and available to the user.

3.6 Evaluation

The flexibility and usability of the developed and implemented approaches are here demonstrated by a use case scenario. Regarding the flexibil- ity it shall be shown that the developed toolchain for CIM C++ code adaption (presented in Sect. 3.3) and the CIM-Unmarshalling-Generator can be successfully applied on a given CIM model, which was changed or extended by a visual UML editor. As already mentioned, the cur- rently available open-source version of libcimpp was generated and can be used for deserialization of different CIM versions as published by the CIMug. However, the main goal was to make CIM++ able to deserialize objects of classes added with a visual UML editor. This flexibility will be shown exemplarily by newly introduced component classes, which are missing in the original CIM standards and needed in the SINERGIEN co-simulation environment. There, the original CIM classes are extended by a Sinergien package containing the mentioned additional classes. One of them is the class BatteryStorage which has become necessary now that battery storages are increasingly integrated on distribution level. After an extension of the original IEC 61970 standard (iec61970 cim16v29a_iec61968cim12v08_iec62325cim03v01a from CIMug) by the Sinergien package with the aid of Enterprise Architect (v11.0.1106), the CIM C++ code is generated and the introduced toolchain for adapting the CIM C++ code to be compilable is applied. This also allows an application of Doxygen on the code which generates the developer docu- mentation i. a. for the added Sinergien classes [FEId]. For instance, this

50 3.7 Conclusion and Outlook also includes the collaboration diagram of the BatteryStorage class as depicted in Fig. 3.4. After the toolchain for CIM C++ code adaption, the CIM-Unmarshalling-Generator is applied, which completes libcimpp by the code for unmarshalling. The correct functionality of the generated unmarshalling code is demonstrated in [Raz+18a] and in Chap. 4 by the translation of an established power grid topology with the aid of libcimpp. Among others, this shows the correct functionality of the unmarshalling code generated by the CIM-Unmarshalling-Generator.

3.7 Conclusion and Outlook

In this chapter the concept of an automated CIM RDF/XML (de-)serializer generation has been presented. The approach is based on an automated mapping from CIM UML to compilable C++ code with the aid of a visual UML editor, a compiler front-end, and a template engine. Using these components, the implemented code adaption toolchain is flexible enough to generate correct CIM C++ code from different CIM based ontologies which then, together with the automated generated unmarshalling code, can be integrated into the libcimpp (de-)serializer library. Besides software technical improvements related to the libcimpp itself, the approach could be extended by serialization from C++ objects to CIM RDF/XML documents as well as to XML streams e. g. for XMPP commu- nication. After a definition of the required steps, the so-called marshalling code can be added to the classes by the code adaption toolchain. Additionally, it could happen that the generated CIM C++ codebase contains circular class dependencies. In case of present CIM models there are only few of them (always at the same positions) which is why they are resolved during code adaption by the mentioned code patches using forward declarations. Although circular dependencies should be avoided by a clean UML design, it could be researched how such forward declarations and different solutions could be applied by the code adaption toolchain in an automated way. Such efforts currently contribute to the first drafts of a harmonization standard [IEC17]. Differently to the mapping from CIM primitive data types to intrinsic C++ types and classes presented in this work, in [Lee+15] a data type unification of IEC 61850 and CIM is shown. This also includes definitions of operations from CIM and IEC 61850 types to unified data types using Query/View/Transformation (QVT) which is specified by the Object Management Group (OMG) as part of MDA. Since the main importance for libcimpp is to store data adequately, transformations are only performed if a sufficient accuracy can be achieved as specified by the

51 BaseClass IEC61970::Base::Core ::IdentifiedObject ::Domain::ElectricalCapacity Sinergien::EnergyGrid IEC61970::Base::Core ::PSRType PSRType IEC61970::Base::Domain ::PowerSystemResource IEC61970::Base::Core ::Boolean requiresCommunication normallyInService isAvailableWLAN isAvailableBPLC isAvailableFiber isAvailableLTE aggregate discrete

Chapter 3 Automated De-/Serializer Generationenabled ::communicationRequirement Sinergien::Communication IEC61970::Base::Wires IEC61970::Base::Core ::RegulatingControl ::Equipment capacity controlEnabled ::ConductingEquipment IEC61970::Base::Core RegulatingControl m_communicationRequirement IEC61970::Base::Domain IEC61970::Base::Domain IEC61970::Base::Domain IEC61970::Base::Domain IEC61970::Base::Wires ::RegulatingCondEq ::ReactivePower ::ApparentPower ::ActivePower ::Voltage nominalQ nominalP ratedU ratedS ::EnergyStorage::BatteryStorage Sinergien::EnergyGrid

Figure 3.4: Section of collaboration diagram for BatteryStorage generated by Doxygen on the automated adapted CIM C++ codebase. The entire diagram can be found in [FEIb] 52 3.7 Conclusion and Outlook

CIM standards. However, conform to the CIM++ approach, the generated CIM C++ classes could be extended during their automated adaption by member functions providing such QVTs for areas where a harmonization with IEC 61850 is desirable. Our approach to synchronize UML models and source code in an au- tomated way are continuously improved [GDD+06; Sad+09]. The main idea behind such RTE methods is a more visual software development [Die07] which is not finished after the software design phase but is itera- tively repeated during the implementation phase. Therefore, there is also ongoing research which began with reverse engineering methods and so- called Computer-Aided Software Engineering (CASE) tools [Nic+00]. For instance, in [Kol+02] a comparison of the reverse engineering capabilities between commercial and academic CASE tools is presented. Because of the increasing complexity of software systems, the application of MDA based methods is becoming more and more important. Thus, our approach contributes to these efforts. Besides, generic XML and RDF/XML parsers as mentioned in Sect. 3.4.2, which are subjects of research activities as well [Mae12], there is also a CIM specific parser available with serialization capabilities according to [IEC16a] called PyCIM [Lin], currently supporting only CIM versions until 2011. Since the project is not maintained anymore, a new project for CIM document (de-)serialization called CIMpy is developed at ACS. Besides CIM it will also support CGMES which is defined using information on CIM [ENT16]. CGMES is currently also being integrated into libcimpp in an automated way with deserialization as well as serialization capabilities.

53

4

From CIM to Simulator-Specific System Models

In Chap. 3 the relevance of the Common Information Model (CIM) for power grids has been outlined an automated generated (de-)serializer library for documents based on the CIM has been presented. Because of the widespread use of CIM-based grid topology interchange, commercial power system simulation and analysis tools such as NEPLAN and PowerFactory can handle CIM. The problem of such proprietary simulation solutions in academic area often is an insufficient or unavailable possibility for component model as well as solver modifications. As a consequence many open-source and free power system simulation tools have been developed during recent years as, for instance, MATPOWER [MAT19] which is compatible to the proprietary MATLAB as well as the open- source GNU Octave [Eat19] environment [ZMT11] and its Python port PYPOWER [Lin19a] as well as pandapower [Fra19]. Other open-source solutions are programmed in the object- and component-oriented multi- domain modeling language Modelica [Fri15b]. Since it allows a declarative definition of the model equations, the Modelica user resp. programmer does not need to transform mathematical models into imperative code (i. e. assignments). Modelica simulations can be executed with the aid of proprietary environments such as Dymola and open-source ones such as OpenModelica [Fri+06] and JModelica [Åke+10] with various numerical back-ends. Modelica libraries with models for power system simulations are PowerSystems [FW14] and ModPowerSystems [MNM16]. The use of Modelica for power system simulation is not limited to the academia but it

55 Chapter 4 From CIM to Simulator-Specific System Models is also applied in real operation, especially with CIM-based grid topologies as shown in [Vir+17]. However, in the presented approach an intermediate data format, called IIDM, is used. The main contribution of this chapter is the presentation of a template- based transformation from CIM to Modelica system models. It has been implemented in the open-source tool called CIMverter which, in its current version, transforms CIM documents into Modelica system models based on arbitrary Modelica libraries, as specified by the user. The transformation into arbitrary Modelica system models allows the execution of any kind of Modelica simulations which shall make use of information stored in CIM documents. To achieve this, CIMverter utilizes a template engine that processes template files written in Modelica, con- taining placeholders. These placeholders are filled by the template engine with data from CIM documents and combined to a complete system model that can be simulated in an arbitrary Modelica environment. The use of a template engine leads to encapsulation, clarity, division of labor, component reuse, single point-of-change, interchangeable views, and so forth, as stated in [Par04]. For instance, this means that in case of many interface changes of a component model, the Modelica user does not need to modify the CIMverter source files but just the templates written in Modelica. Hence, there is no special knowledge of CIMverter’s programming language (C++) or any domain-specific language (DSL) needed. Furthermore, this chapter presents examples on how CIM objects can be mapped to objects of a usual Modelica power system library. Our template-based approach can also be used for conversions to formats other than Modelica. Therefore, also system models of the Distributed Agent-Based Simulation of Complex Power Systems (DistAIX) simulator [Kol+18] can also be generated from CIM documents just through the undertaken adaption of the template files used by CIMverter for the transformation. This chapter gives a short introduction to data formats as well as the main software components used in CIMverter followed by an overview of the overall concept. Then it describes how the mapping from CIM to Modelica is performed at top level and on bottom level with the usage of a C++ representation of the Modelica classes in the so-called Modelica Workshop. Following this, the approach and implementation is evaluated with the aid of two Modelica power system libraries and validated with a commercial simulation tool. Finally, related work is discussed and the chapter is concluded by a roundup and an outlook of future work. The work in this chapter has been partially presented in [Raz+18b]1.

1 “CIMverter—a template-based flexibly extensible open-source converter from CIM to Modelica” by Lukas Razik, Jan Dinkelbach, Markus Mirz, Antonello Monti is licensed under CC BY 4.0

56 4.1 CIMverter Fundamentals

4.1 CIMverter Fundamentals

For an introduction to CIM, RDF, and XML please have a look into Sect. 3.1. In the following Modelica and template engines will be introduced briefly.

4.1.1 Modelica

Modelica enables engineers to focus on the formulation of the physical model by the implementation of the underlying equations in a declarative manner [Fri15b]. The physical model can be readily implemented without the necessity to fix any causality through the definition of input and output variables, thus, increasing the flexibility and reusability of the models [Til01]. Besides, existing Modelica environments relieve the engineer from the implementation of numerical methods to solve the specified equation system.

Modelica Models

The concept of component modeling by equations is shown exemplarily in List. 4.1 for a constant power load, which is typically employed to represent residential and industrial load characteristics in electrical grid simulations. The presented PQLoad model is part of the ModPowerSystems [MNM16] library and is derived from the base model OnePortGrounded using the keyword extends, underlining that the Modelica language supports object- oriented modeling by inheritance. In the equation section, the physical behavior of the model is defined in a declarative manner by the common equations for active and reactive power. The parameters employed in the equations are declared in the PQLoad model beforehand, while the

Listing 4.1: Component model of a constant power load

model PQLoad " Constant power load " extends ModPowerSystems.Base.Interfaces. ComplexPhasor.SinglePhase.OnePortGrounded ; parameter SI.ActivePower Pnom = 0.5 e6 " active power "; parameter SI.ReactivePower Qnom = 0.5 e6 " reactive power "; equation Pnom /3 = real (v* conj (i)); Qnom /3 = imag (v* conj (i)); end PQLoad ;

57 Chapter 4 From CIM to Simulator-Specific System Models declarations of the complex variables voltage and current are inherited from the base model OnePortGrounded. A complex system, e. g., an en- tire electrical grid, can be implemented as system model by instantiating multiple components and specifying their interaction by means of connec- tion equations, see line 25 in List. 4.6. The connect construct involves two connectors and introduces a fixed relation between their respective variables, e. g., between their voltages (equality coupling) and currents (sum-to-zero coupling). Typically, Modelica environments provide a GUI for the graphical composition of models.

Modelica Simulations

In [Fri15a] the translation and execution of a Modelica system model is sketched. At first, the system model is converted into an internal represen- tation (i. e. an abstract syntax tree (AST)) of the Modelica compiler. On this representation, the Modelica language specific functionality is applied and the equations of the used component models (which are the blocks in the graphical representation of the system model) are connected together. This is resulting in the so-called flat model. Then all equations are sorted according to the data-flow among them and transformed by algebraic simplification algorithms, symbolic index reduction methods, and so forth, to a set of equations that will be solved numerically. For instance, duplicates of equations are removed. Also, equations in explicit form are transformed to assignment statements (i. e. an imperative form) which is possible since they have been sorted. The established execution order leads to an evaluation of the equations in conjunction with the iteration step of the numeric solver. Subsequently, the equations are translated to C code, equipped with a driver (i. e. C code with a main-routine), and compiled to an executable (i. e. a program) which is linked to the utilized numerical libraries. This program is then executed accordingly to a configuration file which defines, e. g., the simulations start and end times, numerical methods to be utilized, simulation results format, and so forth. Initial values are usually taken from the model definitions in Modelica. For the conversion from CIM to Modelica system models it must be defined where the topology parameters (written in the CIM document to be converted) must be placed in the Modelica system model (i. e. the resulting Modelica file). For this purpose, a template engine is used, which function principle is introduced in the following.

58 4.2 CIMverter Concept

4.1.2 Template Engine A template engine (also called template processor or template system) is common in web site development and generates the Modelica code. Template engines allow the separation of model (i. e. logic as well as data) and view (i. e. resulting code). For CIMverter it shortly means that there is no Modelica code within the C++ source code of CIMverter. To achieve this, template engines have a data model for instance based on a database, a text / binary file, or a container type of the template engine’s programming language, template files also called templates) written in the language of the result- ing documents together with special template language statements, and result documents which are generated after the processing of data and template files, so-called expanding, as illustrated in Fig. 4.1, where an example HTML code template with a place holder {{name}} is filled with the name from a database, resulting in a complete HTML document. Such place holders are one type of template markers.

4.2 CIMverter Concept

The concept of CIMverter is depicted in Fig. 4.2. The upper part shows the automated code generation process from the definition of the ontol- ogy by CIM UML to the unmarshalling code generation of the CIM++ (De-)Serializer library libcimpp. The middle part shows the transformation process from a given topology (based on the specified CIM ontology) to a Modelica system model, based on Modelica libraries which are addressed

Template

Hello {{name}}! Output

Template Engine Hello World! Database

name = "World"

Figure 4.1: Template engine example with HTML code

59 Chapter 4 From CIM to Simulator-Specific System Models by appropriate Modelica templates. It uses and extends the concept of CIM++ as introduced in [Raz+18b]. The CIM UML ontology can be edited by a visual UML editor and exported to a CIM C++ codebase which is not compilable and therefore needs to be completed by the CIM++ code toolchain. The resulting adapted CIM C++ codebase, representing all CIM classes with their relations, is compilable and used by the CIM++ (Un-)Marshalling Generator for the generation of code which is needed for the actual deserialization process of libcimpp. The CIM++ toolchain and the (Un-)Marshalling Generator are applied in an automated way, whenever the ontology is changed. This keeps libcimpp compatible with newest CIM RDF/XML documents. CIMverter uses the libcimpp for deserialization of CIM objects from RDF/XML documents to C++ objects. Therefore, CIMverter also includes the adapted CIM C++ codebase, especially the headers for all CIM classes. Due to ongoing development of CIM and the concomitant automated modifications of these headers, one might suppose that the CIMverter development has to keep track of all CIM modifications but in the vast majority of cases a subsequent modification of CIMverter code is unneeded. This is because the continuous development of CIM mostly leads to new

Adapted CIM C++ CIM++ Code Toolchain CIM C++ Codebase Codebase

Visual UML Editor CIM UML Ontology CIM++ Unmarshalling Generator

CIM++ Deserializer

CIM XML/RDF Topology CIMverter Document(s)

Template Engine Topology Editor CIM based Topology CIM_TR1 n : 1 +

Modelica Modelica CIM_N1

Modelica CIM_L1_2 Libraries Templates Workshop

+ CIM_Load1_I CIM_Load1_H CIM_N2 Modelica Editor Component Model(s) System Model

Figure 4.2: Overall concept of the CIMverter project

60 4.2 CIMverter Concept

CIM classes with further relations or new attributes in existing classes. Such extensions of existing CIM classes require no changes on CIMverter code using them. With a Modelica editor, the component models of Modelica libraries can be edited. In case the interface of a component model is changed, the appropriate Modelica template files have to be adapted by the CIMverter user. Thereby, using the template engine with the concomitant model-view separation leads to the following advantages: clarity: the templates are written in Modelica with only few kind of tem- plate keywords (i. e. markers). division of labor: the CIMverter user, typically a person with electrical engineering background and knowledge of Modelica, can adapt the Modelica templates easily in parallel with the CIMverter programmer reducing conflicts during their developments. While the engineer does neither need any C++ programming skills nor any knowl- edge of CIMverter internals, the programmer does not need to keep CIMverter up-to-date with all Modelica libraries that could be used with CIMverter. component reuse: for better readability, templates can include other tem- plates, which can be reused for different component models of the same or further Modelica libraries. interchangable views: some Modelica models can be compiled with various options, e. g., for the use of different model equations, which can be defined directly in the code of the system model. For this purpose, the user can easily specify another set of templates. maintenance: changes to the Modelica code to be generated, which are needed, e. g., due to changes of component model interfaces, can be achieved by editings of template files in a multitude of cases. Changing a template, by the way, is less riskier than changing a program which can lead to bugs. Furthermore, recompiling and reinstalling of CIMverter is unnecessary. As already pointed out, some changes to the Modelica libraries require more than a template adaption which is related to the mapping of the deserialized CIM C++ objects to the dictionaries of the template engine used to complete the Modelica templates to full system models. For a clear mapping between relevant data from the CIM C++ objects to the template dictionaries, the Modelica Workshop was introduced. For each Modelica component, the Workshop contains a C++ class with attributes

61 Chapter 4 From CIM to Simulator-Specific System Models holding the values to be inserted in the appropriate dictionary, which will be used for the Modelica code fragment expansion of the belonging component within the Modelica system model. The mapping from CIM C++ objects to these Modelica Workshop objects is defined by C++ code. An alternative would have been the introduction of a DSL for a more flexible mapping definition. However, a really flexible DSL would have to support data conversions and computations for data mappings from CIM to Modelica class instances. Despite tools for DSL specification and parser generation etc., the complexity of the CIMverter project would increase. Moreover, CIMverter users as well as the programmers would need to get familiar with the DSL. Both reasons would make CIMverter’s maintenance and further development more sophisticated and therefore less attractive to potential developers. For instance, the co-simulation framework mosaik at the beginning also made use of a specially developed DSL for scenario definitions [Sch11] but it was removed later on and now the scenarios are described by Python, in which mosaik is implemented, as this is more flexible and powerful. The Modelica Workshop and other implementation design aspects, as described in the next sections, shall perform the C++ coded mappings in an intuitive and understandable way, making CIMverter therefore easily extensible by further Modelica component models and libraries.

4.3 CIMverter Implementation

As described conceptually, CIMverter utilizes libcimpp for deserialization of CIM topology documents (e. g. power grids) for the generation of full system models based on the chosen Modelica library (e. g. ModPowerSys- tems). C++ was selected as programming language because of libcimpp, with its including CIM C++ codebase, as well as CTemplate, both written in C++. As a static, strong type-checking language with less runtime type information (RTTI) capabilities than a dynamic language such as e. g. Python, speculative dynamic typecasts are used for a return of the correct CIM C++ class object. Anyway, the time for converting CIM to Modelica models in comparison to the compile time of the generated Modelica models is negligible. The usage of C++ also allows looking up CIM details in the Doxygen documentation generated from the adapted CIM C++ codebase of CIM++. CIMverter has a command line interface (CLI) and follows the UNIX philosophy of developing one program for one task [MPT78; Ray03]. Therefore, it can be simply integrated into a chain of tasks which need to be performed between the creation of a CIM topology and the sim-

62 4.3 CIMverter Implementation ulations within a Modelica environment as realized in the SINERGIEN Co-Simulation project [Mir+18] described in Chap. 2. A configuration file is handled with the aid of the libconfig++ library, where i. a. the default graphical extent of each Modelica component can be adjusted. It also allows the definition of default CIM datatype multipliers (e. g. M for MW in case of IEC61970::Base::Domain::ActivePower) which are not defined in some CIM RDF/XML documents such as the ones from NEPLAN based on the European Network of Transmission System Operators for Electricity (ENTSO-E) profile, specified by [ENT]. After these implementation details, in following subsections the main aspects of the overall implementation are presented.

4.3.1 Mapping from CIM to Modelica The mapping from CIM documents to Modelica system models can be divided into three levels of consideration as in [Cao+15]. At first level, there are the library mappings. The relevant data from CIM C++ objects, as deserialized by CIM++, is first stored in an inter- mediate object representation (i. e. in the Modelica Workshop) with a class structure similar to the one of the Modelica library. Hence, for each Modelica library there can be a set of appropriate C++ class definitions in the Modelica Workshop. Object mappings are part of the second level. There are not just one-to- one mappings, as illustrated in Fig. 4.3. Sometimes, several CIM objects are mapped to one Modelica object resp. component, such as the IEC61970 ::Base::Wires::PowerTransformer. There are also CIM objects like IEC61970::Base::Core::Terminal (electrical connection points, linked to other CIM objects) which are not mapped to any Modelica component models. Parameters and unit conversions are performed at the third level between the CIM C++ objects and the Modelica Workshop objects. Examples are

CIM C++ Object Object Modelica

Object Object

Figure 4.3: Mapping at second level between CIM and Modelica objects

63 Chapter 4 From CIM to Simulator-Specific System Models voltages, coordinates, and so forth. The next section faces the second and third level mappings as part of the Modelica Workshop but before, the CIM object handling is explained.

4.3.2 CIM Object Handler The CIMObjectHandler is in charge of the CIM objects handling. List- ing 4.2 shows a part of its main routine ModelicaCodeGenerator. This is

Listing 4.2: Snippet of the routine ModelicaCodeGenerator

ctemplate :: TemplateDictionary * dict = new ctemplate :: TemplateDictionary (" MODELICA "); ... for ( BaseClass * Object : this -> _CIMObjects ){ if( auto * tp_node = dynamic_cast < TPNodePtr >( Object )){ BusBar busbar = this -> TopologicalNodeHandler ( tp_node , dict ); ... std :: list < TerminalPtr >:: iterator terminal_it ; for ( terminal_it = tp_node -> Terminal . begin (); terminal_it != tp_node -> Terminal . end (); ++ terminal_it ){ ... if( auto * power_trafo = dynamic_cast < PowerTrafoPtr >( (* terminal_it ) -> ConductingEquipment )){ Transformer trafo = PowerTransformerHandler ( tp_node ,(* terminal_it ), power_trafo , dict ); Connection conn (& busbar ,& trafo ); connectionQueue . push ( conn ); } ... because topological nodes have a central role in bus-branch based CIM topologies of power grids [Pra+11]. Therefore, finding a TopologicalNode (saved as tp_node), a busbar object of the Modelica Workshop class BusBar is initialized with it. busbar is needed later on, for the connections of all kind of conducting equipment (i. e.. power grid components) that is connected to it. Then, the inner loop iterates over all terminals of the found tp_node and checks which kind of ConductingEquipment is connected by the respective terminal to the tp_node. In case of a PowerTransformer, a trafo object of the Modelica Workshop class Transformer is initialized with the data from the PowerTransformerHandler. Furthermore, a new connection between the previously created busbar and the trafo is constructed and pushed on a queue of all connections. These steps are performed for all other kinds

64 4.4 Modelica Workshop Implementation of components, which is why the ModelicaCodeGenerator calls handlers for all of them. The tp_node with the terminal connected to the regarding component (here: trafo) are passed to the appropriate component handler (here: PowerTransformerHandler). Besides, the handler also gets the main template directory dict, called "MODELICA". Within a handler, the con- versions from the required CIM C++ object(s) to the Modelica Workshop object trafo are performed. Furthermore, a subdirectory (here called "TRANSFORMER" used for the Transformer subtemplate, see e. g. List. 4.4) is created and linked to the given main template directory (see List. 4.3). Some conversions are related to the graphical representation of the CIM objects. This is because a graphical power grid editor, which can export CIM documents, can link a IEC61970::Base::DiagramLayout:: DiagramObject to each component, with information about the position of this component, i. e. (x, y)-coordinates, in the coordinate system of the graphical editor. Since the coordinate system of the CIM exporting editor (e. g. NEPLAN) can differ from the one of the Modelica editor (e. g. OMEdit), the coordinates are converted by following code lines: t_points . xPosition = trans_para [0]* x + trans_para [1]; t_points . yPosition = trans_para [2]* y + trans_para [3];

For reasons of flexibility, the four parameters trans_para can be set in the configuration file and in case of NEPLAN and OMEdit are initialized by {1,0,-1,0} (for trans_para[0] to trans_para[3]). Furthermore, the NEPLAN generated CIM documents have several DiagramObject instances linked to one component. To avoid multiple occurrences of the same component in the Modelica connections diagram, the middle point of these DiagramObject coordinates is calculated. This middle point then defines the component’s position in the Modelica connections diagram. Another conversion must be performed for the instance names of Mod- elica classes which are derived from the name attribute of the CIM ob- ject and may not begin or contain certain characters. Each such object derives its name attribute from the elementary IEC61970::Base::Core:: IdentifiedObject superclass. More on the electrics related conversion details will be given in the next section.

4.4 Modelica Workshop Implementation

In List. 4.2, different CIM object handlers (e. g. PowerTransformer Handler) return appropriate Modelica Workshop objects which represent components of the targeted Modelica library. It should be stated at this juncture that CIM is not only related to power grid components and, for

65 Chapter 4 From CIM to Simulator-Specific System Models instance, also includes energy market players (e. g. Customer), Asset, and so forth. Moreover, as presented in [Mir+18], CIM also can be extended by further classes of different domains. Hence, the Modelica Workshop does not need to be reduced to power grid components, even though the current Modelica Workshop is related to components for power grid simulations. This is due to ModPowerSystems as first Modelica library targeted by the CIMverter converter. Nonetheless, the current Modelica Workshop can be used as is for the utilization of another Modelica library as presented in the Evaluation. To avoid reimplementations, each Modelica Workshop class representing a Modelica component, such as Slack or Transformer, inherits from the so-called ModBaseClass.

4.4.1 Base Class of the Modelica Workshop

All Modelica components need an annotation information which defines the visibility of the component, its extent, rotation, etc. Each Modelica Work- shop class, inheriting from ModBaseClass, therefore has an annotation member holding the annotation data in a form as used in the Modelica component’s annotation statement. For this purpose, ModBaseClass also holds several member functions which combine the annotation data to well structured strings as needed for the template dictionary used for filling the annotation statements of all Modelica template files as the annotation statements of all Modelica components have the same structure and the same markers (see lines 12-14 and 20-22 of List. 4.6). For the Modelica statements which differ between different Modelica com- ponents (see lines 8-11 and 16-19 of List. 4.6) there exists a virtual function set_template_values. In each of the component subclasses this function will be overridden with a specialized one which sets all markers that are needed for a complete filling of the belonging Modelica component template, such as presented in List. 4.4. Further member variables of ModBaseClass hold the name of the object and the specified units information, whose default values are set in the configuration file. The object’s name is read from the name attribute of the CIM class IdentifiedObject. Besides, it accumulates objects of the CIM class DiagramObjects, where the objects rotation and points on the GUI coordinate systems are stored.

4.4.2 CIM to Modelica Object Mapping

One of the most interesting mappings is from the CIM PowerTransformer to the Modelica Workshop Transformer class, as presented in Tab. 4.1. The PowerTransformer consists of two or more coupled windings and therefore

66 4.4 Modelica Workshop Implementation

CIM Contained / Accumulated Modelica Workshop PowerTransformer Member Variables Transformer BaseVoltage-> PowerTransformerEnd1 Vnom1 nominalVoltage.value * mV BaseVoltage-> PowerTransformerEnd2 Vnom2 nominalVoltage.value * mV

PowerTransformerEnd1 ratedS.value * mP Sr PowerTransformerEnd1 r.value r PowerTransformerEnd1 x.value x r · Sr · 100 Pcur V 2 √ nom1 r2 + x2 · Sr · 100 Vscr V 2 nom1

Table 4.1: CIM PowerTransformer to Modelica Workshop Transformer mapping. The left column shows the primary and secondary PowerTransformerEnd which accumulate further CIM objects, as listed in the middle column, holding the information needed for the initialization of the Transformer attributes as listed in the right column. The constants mV and mP stand for the voltage and power value multipliers. The bottom of the table shows that additionally two conversions are needed to calculate the rated short circuit voltage Vsc,r and the short circuit losses Pcu,r in percent accumulates objects of the class PowerTransformerEnd which represent the connectors of the PowerTransformer [FEIc]. Further important mappings implemented in the Modelica Workshop are listed in Tab. 4.2.

4.4.3 Component Connections After the instantiations of all components in the Modelica system model, the connections must be defined as well. In List. 4.2 for each newly created component a connection (i. e. instance of Connection class) to the corresponding busbar is created. Therefore, a function template of Connection with the signature template < typename T> void cal_middle_points (T * component ); is called in the constructors of Connection and computes one or two middle points between the endpoints of the connection line. The four different cases for the middle points are illustrated in Fig. 4.4. Furthermore, the connectors of the different components can vary bet- ween different Modelica libraries. Therefore, the connector names can be

67 Chapter 4 From CIM to Simulator-Specific System Models

CIM ModPowerSystems TopologicalNode Slack ExternalNetwoorkInjection ACLineSegment PiLine TopologicalNode EnergyConsumer PQLoad SvPowerFlow

Table 4.2: Excerpt of further important mappings from CIM to ModPow- erSystems as implemented in the Modelica Workshop configured in a separate configuration file, called connectors.cfg, which is included in the directory of the belonging Modelica template files. Its settings are read by all Connection constructors, combined, and fed into the dictionary which is used for filling the connections subtemplate, in- cluded by the main template file. The final Modelica code generation will be exemplarily presented in the next section.

4.5 Evaluation

For an evaluation of the approach and its implementation, exemplary tem- plates as well as the resulting Modelica models are shown. To demonstrate the flexibility and applicability of CIMverter, two different power system libraries are used; the ModPowerSystems and the PowerSystems library. Besides, the simulation results obtained with the generated models are validated against the commercial simulation tool NEPLAN. The main Modelica template defines the overall structure of the Model- ica system model and contains markers for component instantiations and connection equations, List. 4.3. The inserted subtemplates hold informa- tion regarding the library and package from which the models are taken. For instance, see line 1 in the corresponding subtemplates, List. 4.4 (for ModPowerSystems) and List. 4.5 (for PowerSystems), of the Transformer

zero one two

Figure 4.4: Connections with zero, one, and two middle points between the endpoints. The endpoints are marked with circles

68 4.5 Evaluation model. As use case, we generate the components for a steady-state sim- ulation of a symmetrical power system in balanced operation. For the ModPowerSystems library, we utilize models from the PhasorSinglePhase package, since complex phasor variables and a single phase representation are functional for this type of simulation. In case of the PowerSystems library, we perform the simulation with models from the AC3ph package, obtaining comparable results by considering the dq0 transform in the synchronously rotating reference frame. Other types of simulation might be performed by changing package and model names accordingly in the subtemplates. The considered Transformer subtemplates, List. 4.4 and List. 4.5, contain markers to define primary and secondary nominal voltage as well as rated apparent power. The interface of the ModPowerSystems component specifies the Transformer’s electrical characteristics by rated short circuit voltage Vsc,r and short circuit losses Pcu,r, while resistance R and reactance X are defined for the PowerSystems component. In our use case, we model the benchmark system described in [Rud+06], which is a medium-voltage distribution network with rural character. Integrated components are a slack bus, busbars, transformers, Pi lines, and PQ loads. An extract of the resulting Modelica system model generated from the CIM data with the presented CIMverter converter shows List. 4.6. The system model of the benchmark grid was additionally generated

Listing 4.3: Main Modelica template related to ModPowerSystems, includ- ing several sections (e. g. SYSTEM_SETTINGS) and subtemplates (e. g. PQLOAD)

{{# HEADER_FOOTER_SECTION }} model {{ GRID_NAME }} {{/ HEADER_FOOTER_SECTION }} {{# SYSTEM_SETTINGS_SECTION }} inner ModPowerSystems.Base.System {{ NAME }}( freq_nom ( displayUnit = "{{ FNOM_UNIT }}") = {{ FNOM }}) annotation ( Placement ( visible = {{ VISIBLE }}, transformation ( extent = {{ TRANS_EXTENT_POINTS }}, rotation = {{ ROTATION }}))); {{/ SYSTEM_SETTINGS_SECTION }} ... {{ >PQLOAD }} {{ > TRANSFORMER }} ... equation {{ > CONNECTIONS }} {{# HEADER_FOOTER_SECTION }} ... end {{ GRID_NAME }}; {{/ HEADER_FOOTER_SECTION }}

69 Chapter 4 From CIM to Simulator-Specific System Models

Listing 4.4: Transformer subtemplate related to ModPowerSystems li- brary

ModPowerSystems.PhasorSinglePhase.Transformers.Transformer {{ NAME }}( Vnom1 = {{ VNOM1 }}, Vnom2 = {{ VNOM2 }}, Sr( displayUnit = "{{ SR_DISPLAYUNIT }}") = {{ SR }}, Pcur = {{ PCUR }}, Vscr = {{ VSCR }}) annotation ( Placement ( visible = {{ VISIBLE }}, transformation ( extent = {{ TRANS_EXTENT_POINTS }}, rotation = {{ ROTATION }}, origin = {{ ORIGIN_POINT }})));

Listing 4.5: Transformer subtemplate related to PowerSystems library

PowerSystems.AC3ph.Transformers.TrafoStray {{ NAME }}( redeclare record Data = PowerSystems.AC3ph.Transformers.Parameters.TrafoStray ( puUnits = false, V_nom = { {{ VNOM1 }}, {{ VNOM2 }} }, r = { {{R}}, 0 }, x = { {{X}}, 0 }, S_nom = {{ SR }})) annotation ( Placement ( visible = {{ VISIBLE }}, transformation ( extent = {{ TRANS_EXTENT_POINTS }}, rotation = {{ ROTATION }}, origin = {{ ORIGIN_POINT }})));

for the use of the PowerSystems library, simply by switching from the ModPowerSystems to the PowerSystems template set. The connection diagrams of the resulting models, Fig. 4.5, show the same grid topology involving the respective components from both libraries. For the validation of both Modelica system models, they were built and simulated. Afterwards, the simulation results were compared with the ones of the proprietary simulation tool NEPLAN, Tab. 4.3.

4.6 Conclusion and Outlook

This chapter presents an approach for the transformation from CIM to Modelica. The mapping of CIM RDF/XML documents to Modelica system models is based on a CIM to C++ deserializer, a Modelica Workshop rep- resenting the Modelica classes in C++, and a template engine. CIMverter, the implementation of this approach, is flexible enough to address arbitrary Modelica libraries as presented by the generation of system models for two power system libraries. In case of ModPowerSystems, there is no need of modifying the mappings as implemented in the CIM object handlers while switching to the PowerSystems library. Also, the Modelica Workshop

70 4.6 Conclusion and Outlook

Listing 4.6: Medium-voltage benchmark grid [Rud+06] as converted from CIM to a system model based on the ModPowerSystems library

1 model modpowersystems_mv_benchmark_grid 2 inner ModPowerSystems.Base.System 3 System ( freq_nom ( displayUnit = "Hz") = 50.0) 4 annotation ( Placement ( visible = true, 5 transformation ( extent = {{0.0,-30.0},{30.0,0.0}}, 6 rotation = 0))); 7 ... 8 ModPowerSystems.PhasorSinglePhase.Loads.PQLoad 9 CIM_Load12_H ( Pnom ( displayUnit = "W") = 15000000.000, 10 Qnom ( displayUnit = " var ") = 3000000.000, 11 Vnom ( displayUnit = "V") = 20000.000) 12 annotation ( Placement ( visible = true, 13 transformation ( extent = {{-8.0,-8.0},{8.0,8.0}}, 14 rotation = 0, origin = {237.1,-107.8}))); 15 ... 16 ModPowerSystems.PhasorSinglePhase.Transformers.Transformer 17 CIM_TR1 ( Vnom1 = 110000.000, Vnom2 = 20000.000, 18 Sr( displayUnit = "W") = 40000000.000, 19 Pcur = 0.63000, Vscr = 12.04000) 20 annotation ( Placement ( visible = true, 21 transformation ( extent = {{-8.0,-8.0},{8.0,8.0}}, 22 rotation = -90, origin = {86.0,-64.3}))); 23 ... 24 equation 25 connect ( CIM_N0.Pin1,CIM_TR1.Pin1 ) 26 annotation ( Line ( points ={{153.80,-40.00},{153.80,-56.15}, 27 {86.00,-56.15},{86.00,-72.30}}, 28 color = {0,0,0}, smooth = Smooth.None )); 29 ... 30 end modpowersystems_mv_benchmark_grid ;

classes are compatible with both libraries. Subsequently, the generated system models simulated with a Modelica environment are successfully validated against a common power systems simulation tool. CIMverter has already been successfully applied in the research area of power grid simulations as, for instance, in [Din+18]. It is obvious that the current implementation can also be used for con- versions into other formats than Modelica even with the current Modelica Workshop as the introduced template markers can be used in every file format. Therefore, the Modelica Workshop could be cleaned up and ex- tended to a general Power Systems Workshop, addressing data formats used by other power system analysis and simulation tools. Furthermore, the template-based approach also allows different target system model

71 Chapter 4 From CIM to Simulator-Specific System Models

System freq. = 50.0 Hz CIM_HV_Netz

CIM_N0 CIM_TR1 CIM_TR2 n 1 : n 1 : + +

CIM_N1 CIM_N12 CIM_L1_2 CIM_L12_13 CIM_Load1_HCIM_Load1_I + CIM_Load12_HCIM_Load12_I CIM_N2 + CIM_L2_3

CIM_N13 + CIM_L13_14 CIM_N3 CIM_Load13_I CIM_L3_4 +

CIM_Load3_HCIM_Load3_I + CIM_N14 CIM_N4 CIM_L3_8 CIM_L4_5

CIM_N11 + CIM_Load14_HCIM_Load14_I CIM_Load4_H CIM_L10_11 +

CIM_Load11_H + CIM_N5 CIM_N10 CIM_N8 CIM_L7_8

CIM_Load5_H CIM_Load8_H CIM_L9_10 CIM_Load10_HCIM_Load10_I CIM_L8_9 +

CIM_N7 + +

CIM_N9 CIM_L5_6 CIM_Load7_I

CIM_N6

CIM_Load9_I+

CIM_Load6_H (1) ModPowerSystems

f System f:Parameter f_nom=50 CIM_HV_Netz dq0~ Synchron slackV SteadyState grd1

dq0 CIM_N0 dq0 CIM_TR1 2 1 dq0 CIM_TR2 2 1 dq0 dq0

dq0 dq0 CIM_L1_2 dq0

CIM_N1 CIM_L12_13 p q p q dq0 dq0 dq0 CIM_N12dq0p qdq0p q dq0 CIM_Load1_HCIM_Load1_I CIM_Load12_HCIM_Load12_I dq0

dq0dq0 CIM_L2_3 CIM_N2 dq0 dq0 CIM_L13_14 dq0 CIM_N13dq0p q dq0

CIM_L3_4 dq0 CIM_Load13_I dq0CIM_N3p q dq0p q dq0 dq0 CIM_Load3_HCIM_Load3_I dq0 dq0 CIM_L3_8 dq0 CIM_N14 dq0 dq0p qdq0p q CIM_L4_5 dq0CIM_N4p q dq0 CIM_L10_11

dq0 CIM_Load14_HCIM_Load14_I CIM_Load4_H dq0 dq0 CIM_N11dq0p q CIM_Load11_Hdq0 dq0 dq0 dq0 dq0 CIM_L7_8 p CIM_N5q CIM_N10 dq0p q CIM_N8 dq0 dq0p q dq0p q dq0 CIM_L9_10 dq0 CIM_L8_9

CIM_Load8_H dq0 CIM_Load5_H CIM_Load10_HCIM_Load10_I dq0 dq0 dq0

dq0 CIM_N7p q

CIM_L5_6 dq0 dq0 CIM_Load7_I p q CIM_N9dq0 dq0 dq0 CIM_Load9_I CIM_N6dq0p q CIM_Load6_H (2) PowerSystems

Figure 4.5: Medium-voltage benchmark grid [Rud+06] converted from CIM to a system model in Modelica based on the ModPowerSystems and PowerSystems library

72 4.6 Conclusion and Outlook

Grid NEPLAN ModPowerSystems PowerSystems Node |V | [kV] ∠V [°] |V | [kV] ∠V [°] |V | [kV] ∠V [°]

N0 110.000 0.000 110.000 0.000 110.000 0.000 N1 19.531 -4.300 19.532 -4.268 19.532 -4.268 N10 18.828 -4.900 18.828 -4.852 18.828 -4.852 N11 18.825 -4.900 18.826 -4.852 18.826 -4.852

Table 4.3: Excerpt from the numerical results for node phase-to-phase voltage magnitude and angle regarding the medium-voltage benchmark grid. The models based on the ModPowerSystems and PowerSystems libraries yield equal results using the Dymola environment and dassl solver. The results deviate marginally from the reference results obtained with the proprietary tool NEPLAN, which might be explained by numerical rounding and different solution methods formats than Modelica. Meanwhile, also the system model format for the DistAIX simulator [FEIa] has been implemented. Additionally, the current middle point calculations for the Modelica connections diagrams could be improved by the usage of a graph layout library such as Graphviz [Ell+01]. This would allow CIMverter to equip the outputted document with proper diagram data even if the CIM topology to be converted contains no diagram data at all.

73

5

Modern LU Decompositions in Power Grid Simulation

With the aid of CIMverter, which was presented in Chap. 4, system models based on the ModPowerSystems (MPS) finally can be created from up-to-date industry standard grid models (i. e. based on the Common Information Model (CIM)). This allows scientific studies on real world use cases with usually higher complexity than simple lab examples. These studies often involve newly developed and more accurate models as well as smaller time steps for higher resolution simulations. A possibility to accomplish more accurate simulations within the same computation time is the improvement of the numerical back-end of the utilized simulation environment. During the development of the MPS library [FEI19b] (for more on Modelica see Sect. 4.1.1) by ACS and the iTesla Power System Library (iPSL) i. a. developed by Réseau de Transport d’Électricité (RTE) [AIA19], a cooperation between RTE and ACS was established. Only little time before, the SUNDIALS/IDA solver [Hin+05] for differential-algebraic sys- tems of equations (DAEs) was integrated into OpenModelica, to achieve a potentially higher simulation performance in case of large models with a sparse structure [Ope19a]. During its execution, IDA applies a backward differentiation formula (BDF) to the given DAE, resulting in a nonlin- ear algebraic system of equations that is solved by Newton iterations [HSC19]. Within each iteration, a linear system needs to be solved. For linear system solution, IDA provides several iterative and direct methods [MV11]: BLAS/LAPACK [Uni17; Uni19] implementations are supplied for

75 Chapter 5 Modern LU Decompositions in Power Grid Simulation dense as well as band matrices and KLU [DP10] as well as SuperLU_MT (a multithreaded version of the well-known SuperLU [Sup]) are supplied for sparse linear systems. In the European project PEGASE [CRS+11], KLU has shown the highest overall performance of all compared LU decompositions (the others were LAPACK, UMFPACK, MUMPS, SuperLU_MT, and PARADISO), applied on linear systems (i. e. Jacobian matrices) coming from different power grid simulation scenarios. However, new LU decompositions have been developed since PEGASE: the parallelized NICSLU [CWY13] and BASKER [BRT16] for conventional shared-memory computer architectures [Roo99] and GLU for graphic processing units (GPUs) [Che+15]. This chapter provides a comparison of the mentioned LU decompositions (e. g. KLU, NICSLU, BASKER, and GLU) that are all developed especially for circuit simulation. This comprises a brief introduction to the working principles of the decompositions for an illustration of the main ideas behind them. The subsequent analysis is carried out on a set of benchmark matrices which came up during simulations with Dynaωo, an open-source simulation tool, developed at RTE [Adr19]. Finally, the results are summed up and it a conclusion follows. The work in this chapter has already been partially presented in [Raz+19a].

5.1 LU Decompositions in Power Grid Simulation

In many simulation environments such as OpenModelica [Fri15a], system models with algebraic and differential equations are transformed to a DAE. More on this transformation procedure from system models to DAEs is provided in Sect. 4.1.1. A numeric DAE solver computes the values of all relevant variables in the simulation specified time interval [tstart, tend].

5.1.1 From DAEs to LU Decompositions

Two famous DAE solvers are DASSL [Pet82] and IDA from the open- source SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers (SUNDIALS) [Hin+05]. IDA solves the initial value problem (IVP) for a DAE of the form

F (t, y, y˙) = 0, y(t0) = y0, y˙(t0) = y˙0, (5.1)

N where F, y, y˙ ∈ R , t is the independent (time) variable, y˙ = dy/dt, and the initial values y0, y˙0 are given [HSC19].

76 5.1 LU Decompositions in Power Grid Simulation

The integration method in IDA is the so-called variable-order, variable- coefficient BDF in fixed-leading-coefficient form [BCP96] of order q ∈ 1,..., 5 given by the multistep formula q X αn,iyn−i = hny˙n, (5.2) i=0 where yn, y˙n are the computed approximations to y(tn) and y˙(tn), with (time) step size hn = tn − tn−1 and coefficients αn,i determined dependent on q. The application of this BDF to the DAE results in the following nonlinear algebraic system to be solved at each step: q ! −1 X G(yn) := F tn, yn, hn αn,iyn−i = 0. (5.3) i=0 IDA solves Eq. (5.3) with the Newton method (or a user-defined nonlinear T solver). G(y), where y := yn in the n-th time step and y = (y1, . . . , yN ) ∈ N R , is linearized with the aid of Newton’s method, by applying the Taylor series on the component Gi around y(m), in the m-th Newton iteration:

N X ∂Gi(y(m))  2 Gi(y) = Gi(y(m)) + (yj − yj (m)) + O y − y(m) , ∂yj 2 j=1 (5.4) with i = 1,...,N which can be shortened by using the Jacobian matrix definition [DR08]  ∂G1 ... ∂G1  ∂y1 ∂yN  . .  J =  . .  (5.5) ∂GN ... ∂GN ∂y1 ∂yN to the equation

 2 G y G y J y y − y O y − y . ( ) = ( (m)) + ( (m))( (m)) + (m) 2 (5.6)

Hence, neglecting the Taylor remainder approximation (i. e. the O-term) and setting G(y) to 0, for finding the zeros, in each Newton iteration a linear system of the form

J[yn (m+1) − yn (m)] = −G(yn (m)), (5.7) needs to be solved, where yn (m) is the m-th approximation to yn in the n-th simulation time step. For solving the linear system, LU decompositions can be utilized.

77 Chapter 5 Modern LU Decompositions in Power Grid Simulation

5.1.2 LU Decompositions for Linear System Solving

LU decompositions belong to the category of direct solvers. There are various methods for different matrix types such as, e. g., the Cholesky decomposition for hermetian positive-definite matrices [FO08]. For the decomposition of sparse matrices, special LU decomposition algorithms are used which store just the non-zero entries of the matrices to reduce memory consumption and arithmetic operations. During factorization, a non-zero entry can arise at a position where a zero entry has been before, which is called fill-in. Therefore, LU decompositions usually perform a preordering step before the actual factorization step for fill-in reduction, leading to better memory space and time consumption during the subsequent factorization step [TW67]. In general, the problem of computing the lowest fill-in is NP-complete [Yan81]. Apart from direct solution methods, in [CRS+11] it has also been an- alyzed how well iterative methods for solving linear systems, inside the Newton iterations, perform. The iterative Generalized Minimal Residual Algorithm (GMRES) has been taken as it is convenient for general ma- trices. The conclusion was that GMRES is too costly on the Jacobian matrices from the area of power grids, especially in the case when complex preconditioning methods must be performed before the solver in order to achieve a better convergence behavior. Furthermore, the Jacobian matrices are not only sparse but also generate little fill-in during the processing by the LU decompositions. Similarly, [SV01] states that large electric circuits are not easy to solve in an efficient manner by iterative methods but there is a development potential as there has not been much research done in this area, yet. In the following, the two main steps of current LU decomposition methods are being introduced.

Preprocessing

Usually, the preprocessing consists of a preordering step and partial piv- oting. During the preordering, permutation matrices are computed. The partial pivoting is accomplished to reduce the round-off error during the subsequent factorization. Hence, for a given linear system Ax = b, the final system of equations which has to be solved, after preordering and factorization with pivoting, can be represented as

(P AQ)QT x = P b, where the row permutations as well as partial pivoting are performed by P and the column permutations by Q [DP10].The preordering methods

78 5.1 LU Decompositions in Power Grid Simulation for fill-in reduction are usually based on one of the following heuristics: minimum degree (MD) which belongs to the greedy algorithms [Heg+01]; nested dissection (ND) which is based on graph partition computation by a divide and conquer approach [Geo73].

In General, nested dissection based fill reduction algorithms are more time- consuming [Heg+01] but the results usually lead to less fill-in [KMS92]. Besides the permutations coming from fill-in reductions, there are also other permutations performed during the preprocessing of some LU decom- position methods as well as matrix scaling and scheduling of the parallel factorization (if any). This will be mentioned during the introduction of the particular decomposition method.

Factorization

The actual LU factorization, with the factors L and U is performed on the previously permuted matrix A0 = P AQ, such that A0 = LU, and b0 = P b. For efficiency reasons, preorderings are not performed before each factorization. In case that, e. g., the values of a Jacobian change but the structure remains, the same permutations can be reapplied. In circuit simulation this is very often the case [CWY12].

Solving with the computed LU decomposition

Usually, LU decompositions also provide a functionality for right-hand solving as this needs the permutation of the preordering to return correct results. Hence, for a given A0 = LU, the solution x for Ax = b can be computed from the solution vector x0, whereby

A0x0 = b0 ⇔ Ly = b0 and Rx0 = y with x = QT x0 and b = P T b0. The solving step is computationally less time expensive than the two steps before but it is repeated many times in Newton’s (iterative) method. In this work, the term decomposition is used when the whole method such as KLU is meant. Whereas factorization is meant when the focus rests upon the actual factorization step of the decomposition. The considered LU decompositions for electrical circuits (NICSLU, GLU, and Basker in respect to KLU as reference) are compared in the following.

79 Chapter 5 Modern LU Decompositions in Power Grid Simulation

5.1.3 KLU, NICSLU, GLU, and Basker by Comparison Contrary to KLU, all newer LU decompositions (i. e. NICSLU, Basker, and GLU) are developed especially for modern computer architectures with multi-core central processing units (CPUs) or even GPUs. As there is actually no single-core performance improvement since around the year 2005 [Pre12], the utilization of parallel architectures is of essential importance for a higher runtime efficiency on newer computer hardware coming with more and more CPU cores as well as more performant accelerators.

KLU KLU is a decomposition algorithm for asymmetric sparse matrices in power grid simulation [DP10]. Besides commercial tools, such as the numerical computing environment MATLAB and the circuit simulator Xyce, KLU is integrated into IDA. Since KLU was developed with focus on circuit matrices, it shows a high runtime efficiency in the area of power grid simulations [CRS+11]. Therefore, the OpenModelica and the Dynaωo simulation environment make use of KLU as linear solver within IDA which is utilized as solver for the initial value problems of DAEs resulting from system models. When solving the first matrix (in a sequence), KLU performs four steps: 1. A permutation of the given matrix A, to be factorized into L and U, is performed by the matrices P (row) and Q (column permutation) into a block triangular form (BTF):

A11 A12 ··· A1n  . . . . 0 A . . P AQ =  22   . .   ......  . . . . 0 ··· 0 Ann The diagonal blocks can be independent and therefore be the only ones requiring factorization.

2. The Approximate Minimum Degree (AMD) ordering algorithm is performed block-wise on each block Akk for fill-in reduction [ADD04]. Fill-in is defined as a non-zero entry arising during factorization in L or U at a position at which A has a zero entry. Fill-in reduction is a crucial step in sparse matrix factorizations as new non-zero entries in sparse matrices require memory space (zero entries need no space). This leads to a higher memory consumption and, during further processing, to more memory accesses which can be very

80 5.1 LU Decompositions in Power Grid Simulation

time-costly, decreasing the performance of the whole factorization significantly (esp. on modern processors because of the memory wall [ECT17]). Therefore, KLU is optimized for fill-in reduction of circuit matrices. Alternatively to AMD, the Column Approximate Minimum Degree (COLAMD) ordering algorithm [Dav+04] or CHOLMOD, such as nested dissection based on METIS (an unstructured graph partitioning and sparse matrix ordering algorithm [KK95]), as well as a user-defined permutation can be chosen for each block.

3. Each Akk is scaled and symbolically as well as numerically factorized using KLU’s implementation of Gilbert/Peierls’ (GP) left-looking algorithm. The scaling of the block matrices (i. e. achieving matrix entries with comparable magnitudes) is a pre-step for pivoting which is performed on each Akk as the factorization method is also applied block-wise and leads to a higher numerical stability. 4. Optional: The whole system is solved with the resulting factorization using block back substitution. In case of subsequent factorizations of matrices with same non-zero pattern, the first two steps are omitted and in the third step a simplified left-looking method does not perform the partial pivoting. Therefore, this is called a refactorization step which allows the omission of a depth-first search within the GP algorithm, leading to a higher performance. The first two steps build the preordering. A parallelization approach was mentioned in [Abu+18] but without any implementation details. However, the official KLU version is not parallelized.

NICSLU NICSLU is a shared-memory parallelized [Roo99] LU decomposition [CWY13]. Nevertheless, some steps performed by NICSLU are similar to the ones of KLU: 1. Instead of BTF, the MC64 algorithm is utilized for finding a permu- tation and diagonal scaling for sparse matrices. Putting large entries on the diagonal can make the subsequent pivoting numerically more stable. 2. As opposed to KLU, the AMD algorithm for fill-in reduction is not applied on each diagonal block but on the whole matrix. 3. This step determines if the subsequent factorization shall be per- formed (in 4.1.) sequentially or (in 4.2.) in parallel (i. e. with multiple threads, e. g., on several CPU cores).

81 Chapter 5 Modern LU Decompositions in Power Grid Simulation

4.1. The sequential factorization is based on the left-looking GP algorithm, performing a symbolic factorization, a numeric factorization, and partial pivoting.

4.2. The parallel factorization was developed based on the left-looking GP and KLU algorithm [CWY13].

5. Optional: The whole system is solved with the resulting factorization using classical right-hand solving. Analogous to KLU, the first two steps make up the preordering phase and together with step 3 the whole preprocessing. In [CWY13] the authors present a benchmarking i. a. of NICSLU vs. KLU on 23 circuit matrices with NICSLU showing speedups of 2.11 to 8.38 on geometric mean, exe- cuted with 1 to 8 parallel computing threads. These parallel speedups were one reason for the choice of NICSLU in the later presented comparative analysis of modern LU decompositions.

GLU GLU is also a parallelized LU decomposition but for CUDA-enabled GPUs [Che+15]. As it was also developed for circuit matrices, it has similar steps to KLU and NICSLU: 1. MC64 is performed as in NICSLU.

2. AMD is performed as in NICSLU (i. e. on whole matrix).

3. A symbolic factorization, with 0 and 1 as only entries for zero and non-zero values, is performed to determine the structure of L and U as well as grouping of independent columns into so-called column-levels.

4. A hybrid right-looking LU factorization (instead of left-looking as in GP) is performed which benefits from the column-level concurrency and symbolic factorization. The first three (preprocessing) steps are executed on the CPU and only step 4 on the GPU. Experimental results where presented in [Che+15], including, e. g., speedups of 19.56 over KLU on the set of typical circuit matrices from the University of Florida.

Basker Basker is the newest of the four LU decompositions and, like NICSLU, also shared-memory parallelized but from algorithmic point of view mostly

82 5.2 Analysis of Modern LU Decompositions for Electrical Circuits similar to KLU [BRT16]. It was developed as an alternative to KLU for circuit simulation by performing a two level parallelism: between blocks and within blocks, as described below:

1. Such as by KLU, BTF is performed (can be disabled). The resulting matrix has large and small diagonal blocks.

2.1. The small diagonal blocks can be factorized in parallel, by a so-called Fine Block Triangular Structure, as they do not depend on each other. Hereby, a) each small diagonal block is symbolically factorized in parallel and afterwards b) a parallel loop over all these small blocks applies the sequential GP algorithm on each of them.

2.2. Instead, the large diagonal blocks could be too large to be factorized by sequential GP as this could dominate the complete LU decompo- sition time. Therefore, large blocks a) are reordered by ND and b) the ND structure is mapped to threads by using a task de- pendency graph which is transformed into a task dependency tree which represents level sets that can be executed in parallel. After that c) the parallel ND Symbolic Factorization and d) the parallel ND Numeric Factorization are performed.

In [Che+15] a geometric means speedup of 5.91 over KLU is stated for a CPU-based system with 16 cores.

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

For the comparative analysis of the new LU decompositions with KLU as reference, they have been integrated into a measurement environment with drivers for a set of benchmark matrices to evaluate which could be integrated into a power grid simulation environment for further analyses. In this work, the results presented in [Raz+19a] were extended by further analyses, especially on Basker.

83 Chapter 5 Modern LU Decompositions in Power Grid Simulation

5.2.1 Analysis on Benchmark Matrices from Large-Scale Grids

For an equal measurement of all LU decomposition methods, a measure- ment environment was developed in C++ which also helped with the integration of promising methods into proper simulation environments. The driver executes each decomposition and measures the wall clock time of each relevant processing step.

Benchmark Matrices

For an analysis of the correctness and performance of the LU decompo- sitions, a benchmark around seven matrices has been developed. The matrices have been extracted from Dynaωo static phasor simulations of real test cases conducted at RTE, spanning from a regional portion of the grid to a test case representing a merge of the networks of different countries, with high voltage (HV) and extra high voltage (EHV) parts. The modeling choices are the same for all scenarios (except the load mod- els): synchronous machines with their control for classical generation units, standard static var-compensators, controllers as well as special protection schemes (tap and phase shifter, current limit controller, voltage controller, etc.), primary frequency control, and primary as well as secondary voltage regulations. Loads are modeled either as first-order restorative loads denoted as simplified loads (SLs) or as voltage dependent loads (VDLs) behind one or two transformers. Both models are used at RTE depending on the study scope and are thus of practical relevance. Tab. 5.1 presents all benchmark matrices provided by RTE with some information about their origin and their characteristics. Moreover, Fig. 5.1 depicts the matrix sparsity patterns which are typical for power grid matrices. Usually they

No. Power Grid K N NNZ d [%] (1) French EHV with SL 2000 26432 92718 0.013 (2) French EHV with VDL 2000 60236 188666 0.0051 (3) F. + one neighbor EHV, SL 3000 47900 205663 0.0089 (4) F. + one neighbor EHV, VDL 3000 75300 266958 0.0047 (5) F. + neighb. countries EHV, SL 7500 70434 267116 0.0054 (6) F. EHV + regional HV, VDL 4000 197288 586745 0.0015 (7) F. + neighb. countries EHV, VDL 7500 220828 693442 0.0014

Table 5.1: Characteristics of squared matrices with size N × N, K nodes, NNZ sorted by nonzeros NNZ, and with density factor d = N·N in %

84 5.2 Analysis of Modern LU Decompositions for Electrical Circuits show a very low density factor (i. e. number of non-zero elements), mainly concentrated around the diagonal. In all the shown matrices, the upper left part corresponds to the network part. It is followed by a lot of little blocks around the diagonal: the injection models (generators, loads, etc.) which are modeled using only one interface to the network (current and voltage). Finally, the columns in the right part of the matrix, containing non-zero elements, result from the system-wide controls such as calculations of the system frequency that are related to all generators. The density factor is higher with SLs models than with VDLs models as VDLs have much more variables which are mainly linked together but not with outer variables (except through a single network interface). More information on the (a-)symmetry of circuit matrices can be found in [DP10].

Measurements Environment The following execution time measurements were performed on a server with 2 sockets, each with an Intel Xeon E5-2643v4 3.4 GHz (3.7 GHz Turbo), 6 cores CPU with Hyper-Threading (HT); 32 GB DDR4 main memory; NVIDIA TESLA P40, GP102 Pascal, 24 GB GDDR5; running an x86_64 Ubuntu 16.04 Server Linux with kernel a) 4.13.0-46-generic for general measurements, b) 4.11.5-RT (with enabled PREEMPT_RT [Lin19b]) for real-time (RT) kernel measurements, and c) 4.13.16-custom for GLU measurements with NVIDIA driver x86_64-396.44 and CUDA 9.2 . The versions of the LU decompositions and compilers are: KLU v1.3.8 with gcc-7.2.0, NICSLU v3.0.1 with clang-4.0.1-6, GLU v2.0 with g++-7.2.0, all built with compiler optimizations level 2 as this leads to highest perfor- mance. All measured times are wall clock times.

85 Chapter 5 Modern LU Decompositions in Power Grid Simulation

(1) (2)

(3) (4)

(5) (6)

(7)

Figure 5.1: Sparsity patterns of benchmark matrices

86 5.2 Analysis of Modern LU Decompositions for Electrical Circuits

Complete Decomposition

The total execution times (i. e. preprocessing and factorization) for a complete decomposition of the benchmark matrices are plotted in Fig. 5.2. Through almost all matrices, Basker is the most time-consuming method, followed by NICSLU which on some matrices is nearly as performant as KLU. Only in case of matrix no. 3, Basker shows a better performance than NICSLU. As pictured in Fig. 5.1, this matrix has many relatively big blocks on its diagonal. While the times of all CPU-based implementations are below ca. 1 s, the GLU times are in most cases more around 10 times higher. The main reason is the preprocessing time as it can be seen in the next plots.

1.00 KLU GLU NICSLU Basker 10.00

0.10 Time [s]Time Time [s]Time

1.00 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Basker, KLU, and NICSLU (2) GLU

Figure 5.2: Total (preprocessing+factorization) times

Preprocessing

The preprocessing times of KLU, as shown in Fig. 5.3, are lowest in all cases. The main reason for this is the application of AMD on (smaller) submatrices instead of the whole matrix. In case of Basker, not only the whole runtime but also the preprocessing in case of matrix no. 3 is performed relatively quickly in comparison to the other matrices. For GLU it can be seen that the preprocessing occupies most of the total time for a decomposition which is due to the symbolic factorization step performed on the CPU.

87 Chapter 5 Modern LU Decompositions in Power Grid Simulation

1.00 KLU GLU NICSLU Basker 10.00 0.10 Time [s]Time Time [s]Time

0.01 1.00 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Basker, KLU, and NICSLU (2) GLU

Figure 5.3: Preprocessing times

Factorization The factorization times in Fig. 5.4 of Basker are also higher than the ones of KLU and NICSLU. The times which NICSLU needs are mostly equal or lower than the ones which KLU needs for factorization, especially for matrix no. 6 which is one of the larger matrices and has a quite big dense block in its upper left corner as depicted in Fig. 5.1. The factorizations performed by GLU on the TESLA device need only a fraction of the total decomposition time but are still around 10 times slower than KLU and NICSLU on the CPU.

KLU GLU NICSLU 1.00 Basker 0.10 Time [s]Time [s]Time

0.01 0.10 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Basker, KLU, and NICSLU (2) GLU

Figure 5.4: Factorization times

Complete Decomposition and Preprocessing on RT kernel In Fig. 5.5 the execution times for the most promising decomposition methods, KLU and NICSLU, are compared between the generic and the

88 5.2 Analysis of Modern LU Decompositions for Electrical Circuits

RT kernel. KLU needs considerably more time on the RT as on the generic kernel. In case of NICSLU there are only small differences between the kernels. As a consequence, the total times of KLU with the generic kernel are always lower and with the RT kernel often higher, compared to NICSLU. At this point it is important to notice that a real-time optimized system respectively kernel does not need to run faster than a generic one. Instead, the goal is that it runs deterministic within well-specified time constraints. The pure preprocessing times of KLU, as shown in Fig. 5.5, are lowest in all cases. The main reason for this is the application of AMD on (smaller) submatrices instead of the whole matrix. Again, the run-times of KLU on the different kernels differ more than the ones of NICSLU.

KLU KLU 0.25 0.20 NICSLU NICSLU 0.20 KLU-RT KLU-RT 0.15 NICSLU-RT NICSLU-RT 0.15 0.10

Time [s]Time 0.10 [s]Time 0.05 0.05

0.00 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Total (preprocessing+factorization) (2) Preprocessing

Figure 5.5: Execution times on generic vs. RT kernel

Refactorization vs. Factorization

Since neither Basker nor GLU currently supports refactorization, these execution times were only measured for KLU and NICSLU on the generic and the RT kernel as depicted in Fig. 5.6. In both methods the time for refactorizations is much lower than for factorizations. The refactorizations are performed by NICSLU much faster than by KLU. On the RT kernel most NICSLU factorizations of the provided matrices are even faster than KLU refactorizations.

89 Chapter 5 Modern LU Decompositions in Power Grid Simulation

0.15 0.08 KLU fact. KLU-RT fact. NICSLU fact. NICSLU-RT fact. 0.06 KLU ref. 0.10 KLU-RT ref. NICSLU ref. NICSLU-RT ref. 0.04

Time [s]Time [s]Time 0.05 0.02

0.00 0.00 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Generic kernel (2) RT kernel

Figure 5.6: (Re-)factorization times

Parallel Shared-Memory Processing of Basker and NICSLU

The CPU-based LU decompositions in the previously presented measure- ments were executed sequentially. As there is no official parallelized version of KLU available to the authors, only the parallel processing of Basker and NICSLU is considered in the following. The parallel processing of NICSLU is shown in Fig. 5.7. The total execution times for multiple threads are always higher than for one single thread. This cannot be caused by the turbo mode (i. e. higher CPU clock rate) only as the times with two threads are also higher. Even the pure factorization times with multiple threads are higher than a single thread. Obviously, the parallelization of NICSLU does not scale for the benchmark matrices. The reason for the low performance with 16 threads is the total number of 12 physical processors (i. e. 24 logical processors with HT), leading the operating system scheduler to switch between running and waiting threads more often. The accompanied context switching causes longer execution times. The parallel processing of Basker is shown in Fig. 5.8. Contrary to NICSLU, the factorization performed by Basker can scale well with multiple threads, e. g. for matrix no. 6. But still, Basker is not faster than the sequential KLU even with a higher number of threads. Since Basker is in alpha stadium, it can only handle numbers of threads that are power of two. This is why not more than 8 really independently executed threads could have been started on the 12 core system but there were software-technical issues with Basker and some matrices leading to the limit of 4 threads in the measurements.

90 5.2 Analysis of Modern LU Decompositions for Electrical Circuits

0.30 0.125 1 T 1 T 2 T 0.25 2 T 4 T 0.100 4 T 8 T 0.20 8 T 16 T 0.075 16 T 0.15 0.050 Time [s]Time Time [s]Time 0.10 0.025 0.05 0.000 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Total (preprocessing+factorization) (2) Factorization

Figure 5.7: NICSLU’s scaling over multiple threads (T )

KLU KLU 1.00 1 T 0.6 1 T 2 T 2 T 0.75 4 T 0.4 4 T 0.50 Time [s]Time Time [s]Time 0.2 0.25

0.00 0.0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) Total (preprocessing+factorization) (2) Factorization

Figure 5.8: Basker’s scaling over multiple threads (T)

Alternative Preordering Methods

For a performance analysis with different preordering methods (AMD, METIS, and COLAMD), we integrated METIS and COLAMD into NIC- SLU. In case of METIS, the total execution times for the LU decomposi- tions, as depicted in Fig. 5.9, are significantly higher than in case of AMD and COLAMD. The reasons are long execution times of METIS. These can be derived from Fig. 5.10, as the factorization times after METIS preorderings are comparable to the factorization times after AMD and COLAMD. On the generic kernel, KLU benefits from AMD. On the RT kernel it benefits from COLAMD but in case of pure factorizations it can benefit from METIS as well. The NISCLU factorization times, in case of AMD

91 Chapter 5 Modern LU Decompositions in Power Grid Simulation

KLU-AMD 0.8 1.25 NICSLU-AMD KLU-COLAMD NICSLU-COLAMD 0.6 KLU-METIS 1.00 NICSLU-METIS 0.75 0.4 Time [s]Time Time [s]Time 0.50 0.2 0.25

0.0 0.00 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) KLU on generic kernel (2) NICSLU on generic kernel 2.0 0.8 KLU-AMD-RT NICSLU-AMD-RT KLU-COLAMD-RT NICSLU-COLAMD-RT 1.5 0.6 KLU-METIS-RT NICSLU-METIS-RT

0.4 1.0 Time [s]Time Time [s]Time 0.2 0.5

0.0 0.0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(3) KLU on RT kernel (4) NICSLU on RT kernel

Figure 5.9: Total times with different preorderings and COLAMD, are in all cases very close together and lower than after METIS preorderings.

5.2.2 Analysis on Power Grid Simulations

The benchmarks presented in this subsection were performed by RTE and parts of the text were authored by the co-authors from RTE of the publication [Raz+19a]. Because of the low performance in comparison to the other LU decompositions, GLU was not selected for the integra- tion in simulations environments. Basker, however, was integrated into OpenModelica but because of its early development stadium it was not mature enough for adequate simulation benchmarks as it generates errors at certain system sizes. Due to NICSLU’s relatively high performance on the benchmark matri- ces, it was integrated into the IDA versions used by OpenModelica and Dynaωo. Moreover, due to positive performance results in parallel mode,

92 5.2 Analysis of Modern LU Decompositions for Electrical Circuits

0.06 0.08 KLU-AMD NICSLU-AMD KLU-COLAMD 0.05 NICSLU-COLAMD 0.06 KLU-METIS 0.04 NICSLU-METIS

0.04 0.03 Time [s]Time Time [s]Time 0.02 0.02 0.01

1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(1) KLU on generic kernel (2) NICSLU on generic kernel 0.15 KLU-AMD-RT 0.08 NICSLU-AMD-RT KLU-COLAMD-RT NICSLU-COLAMD-RT KLU-METIS-RT NICSLU-METIS-RT 0.10 0.06

0.04

Time [s]Time 0.05 [s]Time 0.02

0.00 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Matrix no. Matrix no.

(3) KLU on RT kernel (4) NICSLU on RT kernel

Figure 5.10: Factorization times with different preorderings

Basker was integrated into the IDA version of OpenModelica for testing. This needs more effort as Basker is in a too early development stage (e. g. it returns errors for certain matrices) which is why it was not integrated into Dynaωo. As a result, simulations were performed with NICSLU in Dynaωo [Gui+18], that contains two solvers utilizing SUNDIALS. Three test cases have been selected to measure the performance of both LU decompositions with the aforementioned two solvers which will be introduced later on: (1) French EHV network with SL models

(2) French EHV network with VDL models

(3) French EHV/HV network with VDL models

Measurements Environment For each test case, the simulation lasts for 200 s with a line disconnection at t = 100 s and is done on a machine with an Intel Core i7-6820HQ 2.7 GHz

93 Chapter 5 Modern LU Decompositions in Power Grid Simulation

(3.6 GHz Turbo), 4 cores CPU with HT; 62 GB DDR4 main memory; running on Fedora Linux with kernel 4.13.16-100.fc25.x86_64. All measured times are wall clock times.

Dynaωo’s Fixed Time Step Solver The first of the two solvers available in Dynaωo is a fixed time step solver, inherited from PEGASE [Fab+11; FC09] and specifically designed for a fast long-term voltage stability simulation. It applies an first-order Euler method using a Newton-Raphson (NR) approximation for resolving the nonlinear system at each time step (with KINSOL, a NR based method available in SUNDIALS). In this approach, the LU decomposition for the Jacobian is computed as few times as possible. Tab. 5.2 shows that only a few milliseconds are spent in the LU de- composition and the Jacobian evaluation. Moreover, most of the time elapses for the residual evaluation. It is important to notice that the LU decompositions are performed only when there are major changes of the grid.

Case KLU NICSLU Eval. JF Eval. F no. [s] C [s] C [s] C [s] C (1) 0.095 3 0.071 3 0.11 3 2.01 561 (2) 0.215 4 0.215 4 0.46 4 5.96 617 (3) 0.847 13 0.790 13 1.61 13 9.41 767

Table 5.2: Total execution times and numbers C of calls of the correspond- ing routines within the fixed time step solver, with Jacobian JF and residual function vector F

Dynaωo’s Variable Time Step Solver The second solver available in Dynaωo is a variable time step solver based on SUNDIALS/IDA plus additional routines to deal with algebraic mode changes due to topology modifications of the grid. Jacobian evaluations and LU decompositions occur much more often than with the fixed time step solver. Table 5.3 presents the results with the variable time step solver. They confirm the trends observed with the individual matrices, i. e. the preorder- ing step takes more time with NICSLU than KLU but these extra costs are offset by a substantial reduction on the factorization and refactorization steps. Usually, there should be mainly refactorizations. Factorizations

94 5.3 Conclusion and Outlook should appear only when there is a change in the matrix structure (that corresponds either to a change in the grid topology or a deep change in the form of the injection equations). Keeping this point in mind, it should be possible to gain time with NICSLU on complete simulation times compared to KLU ( 26.67 s vs. 34.56 s in case (3) ). This gain remains minimal at the moment compared to the overall numerical resolution time ( 36 s in case (1), 102 s in (2), and 266 s in (3) ) but if improvements are also achieved on the other elementary tasks, it could help making the difference in the long term.

5.3 Conclusion and Outlook

This chapter presents most promising recently developed LU decomposition methods (Basker, NICSLU, and GLU) for electric circuit simulation that have been found in current literature. After a short introduction of the main ideas behind the methods, a comparative analysis with KLU (as the reference LU decomposition for power grids) was conducted on benchmark matrices from large-scale phasor time-domain simulation. Through the integration of NICSLU in OpenModelica’s and Dynaωo which is stable, it can already be used in productive environment. The immature Basker implementation, however, can be software-technically improved and tested within the OpenModelica environment, where it was integrated, to gain better runtime stability. The analysis shows that KLU and NICSLU achieve a similar perfor- mance for total execution times on benchmark matrices while Basker’s performance, especially in single-threaded mode, is lower. However, Basker can achieve speedups of the factorization when running parallel threads.

Case Preord. [s] Fact. [s] Refact. [s] Sum [s] D f Method 2.42 2.58 2.85 7.85 461 0.33 KLU (1) 2.74 0.88 0.72 4.34 461 0.33 NICSLU 4.98 2.81 2.72 10,51 466 0.34 KLU (2) 6.28 1.59 1.22 9.09 466 0.34 NICSLU 15.01 10.79 8.76 34.56 899 0.42 KLU (3) 18.96 4.87 2.84 26.67 899 0.42 NICSLU

Table 5.3: Accumulated execution times for the listed steps of the variable time step solver, with D LU decompositions and a factorization #Fact. ratio f = #Refact.

95 Chapter 5 Modern LU Decompositions in Power Grid Simulation

Moreover, Basker’s speedup behaves superlinear in a subset of the bench- mark matrices. Superlinear speedup often occurs due to hardware features regarding CPU caches [Ris+16]. It can be caused by less data amount per thread which is fitting better into caches. Basker’s developers indeed mentioned that for a larger number of threads, the ND-tree may provide smaller cache-friendly submatrices [BRT16]. Since Basker’s implementa- tion is in an alpha state, one could achieve possibly better results with further development. For instance, it is dependend on the Trilinos library [Tri], especially on the parallel execution framework Kokkos. An individual parallelization of Basker, however, could result in a higher performance. GLU, despite its massive parallelization for GPUs, in the presented analy- sis cannot compete with current CPU-based implementations as it showed a much lower performance in all cases. The preprocessing of NICSLU is usually slower than of KLU but espe- cially refactorizations are performed faster. Such as other shared-memory parallelized LU decompositions for sparse systems in many cases, NICSLU cannot make use of multiple CPU cores. This is a problem since CPU clock speeds are not increasing anymore and the performance of processors nowadays is mainly increased by more CPU cores. Executed on an RT kernel, NICSLU has shown a better performance than KLU but there is more causal investigation needed. Both, KLU and now also NICSLU, can benefit from different preorderings. Regarding complete simulations, NICSLU can provide improvements compared to KLU, benefiting from its different refactorization step, which is more common during simulations than a complete factorization step. The analysis of the unitary LU decompositions opens new perspectives for the generic numerical schemes and the choices made to improve the performance of power grid simulation solvers as well as other power grid related software that can make use of LU decompositions. Furthermore, the integration of a performant LU decomposition (esp. into the widely-used SUNDIALS library) allows the simulation environment users to switch between different solvers not just for a better runtime performance under different circumstances – e. g. offline vs. real-time simulations – but also for a different numerical behavior. This can lead to better results in case of possible numerical instabilities but also to an alternative in case of a solver issue.

96 6

Exploiting Parallelism in Power Grid Simulation

Besides runtime improvements through the application of numerical meth- ods such as LU decompositions, that suite better in general or in the special case of power grid simulation, also proper methods from the area of high-performance computing (HPC) can be applied on the regarding simu- lation software. One such that has been recently developed at the Institute for Automation of Complex Power Systems (ACS) is the Dynamic Phasor Real-Time Simulator (DPsim) which introduces the dynamic phasor (DP) approach to real-time (RT) power grid simulation as larger simulation steps are possible without losing accuracy [Mir+19]. This leads to a smaller impact of communication delays, e. g., between geographically distributed simulators running in different laboratories with special Hardware-in-the- Loop (HiL) setups. A reason for the coupling to one RT co-simulation could be the lack of needed resources (e. g. hardware, software, know-how, location, etc.) to run a complete HiL simulation in just one laboratory [Mir+17]. DPsim uses several external software libraries which includes the VIL- LAS framework for the communication with other real-time simulators, control/monitoring software as well as hardware, and so forth. Grid data in a Common Information Model (CIM)-based format is read using the libcimpp library of the CIM++ project as introduced in Chap. 3. Further- more, multiple numerical libraries are used as there are several solvers implemented in DPsim, such as a modified nodal analysis (MNA) based solver which utilizes LU factorizations on dense and sparse matrices of

97 Chapter 6 Exploiting Parallelism in Power Grid Simulation

Eigen [Eig19]. Also the SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers (SUNDIALS) library is used as backend of DPsim’s ordinary differential equation (ODE) solver. To benefit from modern shared-memory multi-core systems, the com- putations within one time step are partitioned into multiple tasks, as defined by the parts of the simulation such as the utilization of different solvers and interfaces (e. g. an interface for real-time data exchange, a user interface for monitoring, end so forth). At this point, it should be noted that different solvers can be utilized within a single time step as this depends on the components of the power grid model. Because of data dependencies between these tasks, they cannot run all in parallel as this would lead to data races with wrong results [Qui03]. Therefore, a task dependency analysis is performed to achieve a data race free parallel task execution. This chapter gives an overview of multiple kinds of model parallelization approaches on different abstraction resp. implementation levels that have been implemented, exemplarily, in the OpenModelica simulation environment. It also introduces schedulers for parallel tasks execution and describes how they are implemented into DPsim in combination with the task dependency analysis. This is followed by a runtime analysis of DPsim with the implemented approach on various power grids of different sizes. The chapter concludes with a discussion on the advantages and disadvantages of the parallel execution as well as on the utilized schedulers. This chapter presents outcomes of the supervised thesis [Rei19].

6.1 Parallelism in Simulation Models

Chapter 5 i. a. dealt with approaches where parallelism within numerical methods (LU decompositions in the present case) is used for shorter execution times on multi-core architectures. But instead of using the parallelism of numerical solvers, it is also possible to exploit the inherent parallelism of the model as such. The inherent parallelism of a model can be either expressed by its developer or automatically recognized. Without any claim to completeness, [Lun+09] describes the following first three types of approaches for exploiting parallelism in mathematical models:

1. Explicit Parallel Programming This type concerns approaches where parallel constructs are expressed in the programming language of the mathematical model itself. For example, ParModelica [Geb+12] is an extension to the modeling language Modelica which allows the user to express parallelism in algorithm sections (i. e.

98 6.1 Parallelism in Simulation Models in imperative programmed parts of a model instead of declarative parts as expressed by equation sections). In this approach, the developer of the model is responsible for its (correct) parallelization. For this purpose, ParModelica provides parallel variables (allocated in different memory spaces) as well as functions, parfor loops, and kernel functions which are executed on OpenCL devices (e. g. graphic processing units (GPUs)) as part of so-called heterogeneous computer systems.

2. Explicit Parallelization Using Computational Components

Another type of explicit parallelization exploitation is achieved by struc- turing the model into computational components using strongly typed communication interfaces. For this, the architectural language properties of Modelica, supporting components and strongly typed connectors, are generalized to distributed components and connectors. An example for this approach is the Transmission Line Modeling (TLM) where the physical model is distributed among numerically isolated components [Sjö+10]. Hence, the equations of each submodel can be solved independently and thus in parallel. This kind of explicit parallelization is implemented in DPsim by System Decoupling in form of two different methods: the Decoupled Line Model and Diakoptics as presented in [Mir20].

3. Automatic Fine-Grained Parallelization of Mathematical Models

Besides the explicit expression of parallelism, it is also possible to extract parallelism from the high-level mathematical model or from the numerical methods used for solving the problem. The parallelism exploitation from mathematical models is categorized by the following the subtypes:

Parallelism over time: for example in case of discrete event simulations where certain events are independent from other events and therefore can be handled in parallel;

Parallelism of the system: this means that the modeled system (i. e. the model equations) is parallelized. There has been much research done on automatic parallelization, especially at equation level methods [Aro06; Cas13; Wal+14].

Similarly to the fine-grained approach, the following new (4.) approach type was introduced.

99 Chapter 6 Exploiting Parallelism in Power Grid Simulation

(4.) Automatic Coarse-Grained Parallelization of Mathematical Models

Rather than exploiting the parallelism at equation level, it is also possible to consider it at component level. This new methodology was implemented into DPsim by splitting one simulation step into separate tasks, whereby every component in the power grid model declares a list of tasks that have to be processed in each simulation step. The approach will be presented in the following.

6.1.1 Task Scheduling

This chapter deals with the scheduling of tasks, i. e. parts of a solution procedure, which can be performed by multiple threads that are spawned on a multiprocessor system by the process’ main thread. It is not about any operating system schedulers for processes running on a single- or multiprocessor system [Tan09]. The term multiprocessor system refers to logical central processing units (CPUs) and therefore includes systems with a single physical CPU and multiple cores as well as systems with multiple physical CPUs and multiple cores. As the simulated models are small enough to fit into the main memory of current workstations and servers, only shared-memory parallel programming is considered. Therefore, multiple threads sharing same memory regions can be used instead of multiple processes running in parallel on multiple interconnected processors with distributed memory. The obstacle in case of shared-memory parallelization is that multiple threads could access same data concurrently which can lead to so-called race conditions [Roo99] causing wrong results. Therefore, a synchronization between the parallel running threads must be performed and the execution order of program statements that are dependent on each other must be kept. For example, if a value is calculated in a program statement S1 and used in S2 as input value, then statement S1 must be executed before S2. Statement S2 depends on S1 and, therefore, both statements cannot be executed in parallel. Dependency analyses on statements have long been subjects of research [WB87] but can also be performed on procedures or tasks. For what applies to single statements, equally applies to a bunch of statements (i. e. tasks). The scheduling of tasks to a set of processors in [KA99] is divided into different categories as pictured in Fig. 6.1. As the considered tasks depend on each other, a scheduling variant from the scheduling and mapping must be chosen, with the two subcategories dynamic scheduling and static scheduling. Dynamic scheduling is chosen when there is not enough a priori information about the tasks’ processing durations available before

100 6.1 Parallelism in Simulation Models their processing. Instead, static scheduling can be used in case there is enough a priori information given which can be used for an mostly efficient scheduling. Static scheduling can again be divided into such based on task interaction graphs and on task precedence graphs. Task interaction graphs can be used when loosely coupled communicating processes need to be scheduled which can be true on a distributed (memory) system. As this is not true in case of the intended shared-memory parallelization, static scheduling based on task precedence graphs (in the following called as task graphs) was chosen. The task processing times of a time step can be exploited for the next steps as long as the mathematical structure of the grid model resp. the control flow within the tasks does not change too much. A reason that the control flow within a task changes could be a switching between one and another simulation step within a component. Whereas a switching between components (e. g. by a breaker) could change the data flow resp. dependency between tasks and therefore require an updated task graph. In the following, some formal definitions for the used terms are introduced.

Basic Terms At this point, a task can be considered as a sequence of program statements that are executed by a processor sequentially. Definition 6.1 (Dependency and task graph) Given a set T = {T1, ..., Tn} of tasks, which is the set of nodes of the belonging task graph, an edge (Ti,Tj ) ∈ E = T × T , with i, j ∈ {1, ..., n},

Parallel Program Scheduling

Job Scheduling Scheduling and Mapping (independent tasks) (multiple interacting tasks)

Dynamic Scheduling Static Scheduling

Task Interaction Task Precedence Graph Graph

Figure 6.1: Categories of parallel task scheduling

101 Chapter 6 Exploiting Parallelism in Power Grid Simulation

expresses a data dependency of Tj on Ti, requiring that Ti must be per- formed before Tj , also denoted as Ti ≺ Tj . The resulting directed acyclic graph (DAG) G = (T,E) is called the task graph.

Definition 6.2 (Task types, weight, and length) Given a task graph G = (T = {T1, ..., Tn},E = T × T ),

• a task V ∈ T without incoming edges, i. e., it holds that ∀U ∈ T : @(U, V ) ∈ E, is called an entry task;

• a task V ∈ T without outgoing edges, i. e., it holds that ∀W ∈ T : @(V,W ) ∈ E, is called an exit task;

• the weight (i. e. execution time) of a task V ∈ T is given by w(V ), with the weight function w : T → N;

• the length lp of a path p = Ti1 ≺ ... ≺ Tik , with k ∈ N tasks, is defined as the sum of the weights of its tasks, i. e., Pk  lp = j=1 w Tij .

In case of a distributed-memory system, also communication costs could be taken into account as edge weights (e. g. because of message passing between the computing nodes) but they are neglected for the intended shared-memory parallelization. An example task graph is given in Fig. 6.2, where the weights of the tasks are given in parantheses beside the task identifier. In the following it is shown how task graphs can be utilized to distribute the tasks among multiple processors in an optimal way regarding their total processing time.

T1(1) T2(2) T3(1)

T4(3) T5(1)

T6(1)

Figure 6.2: Example task graph

102 6.1 Parallelism in Simulation Models

General Scheduling Problem for Parallel Processing A task schedule must provide the start time for each task. This can be formalized on the basis of [Ull75] as follows. Definition 6.3 (Schedule function, optimal schedule) Given a set of tasks T = {T1, ..., Tn} to be executed on a system with p ∈ N processors, a schedule function f : T → N0 which specifies the start time for each task is sought, for which following restrictions hold:

• a task Tj that depends on Ti may not start before Ti, i. e., ∀Ti,Tj ∈ T : If Ti ≺ Tj , then f(Ti) + w(Ti) ≤ f(Tj );

• at each time point, at most p tasks are processed concurrently, i. e., ∀t ∈ N0 : |{V ∈ T |f(V ) ≤ t < f(V ) + w(V )}| ≤ p.

A schedule specified by the schedule function fopt is an optimal schedule iff. the total execution time is minimal under the restrictions above, i. e.,

maxi{fopt(Ti) + w(Ti)} = min∀f {maxi{f(Ti) + w(Ti)}}.

The problem of finding an optimal schedule in case of p = 2 and a single execution time tconst = w(V ), for all V ∈ T , can be solved determin- istically in polynomial time but for p > 2 it is generally NP-complete [Ull75]. Therefore, instead of trying to find an optimal schedule, heuristic algorithms are applied. Two classes of such heuristic schedulers will be presented in the following.

Level Scheduling A level scheduling based approach for equation-based parallelization of Modelica was implemented in OpenModelica. At the beginning all entry tasks are assigned to the first level as they do not depend on each other. All tasks that depend on tasks in the first level only, are assigned to the second level and so forth. Definition 6.4 (Level scheduling) Given a task V ∈ T = {T1, ..., Tn} and the set of predecessors PV = {U ∈ T |U ≺ V }, the level function l : T → N returns the level of the task V according to the recursive definition  0, if PV = ∅ l(V ) = 1 + max{l(S)|S ∈ PV }, otherwise.

As the tasks within a certain level are independent from each other, they can be executed in any order or in parallel. In the simplest form, the tasks

103 Chapter 6 Exploiting Parallelism in Power Grid Simulation

Level 0 T1(1) T2(2) T3(1)

Level 1 T4(3) T5(1)

Level 2 T6(1)

Figure 6.3: Example task graph including levels within a level are therefore distributed among the available processors without regard to their execution times. If the integer division of n by p returns a rest, the remaining tasks are arbitrarily distributed among the processors which causes that certain processors have to execute one task more than the others. Fig. 6.3 shows exemplarily how levels could be assigned to the tasks in Fig. 6.2. Derived from this level assignment, a final schedule for p = 2 processors is illustrated in Fig. 6.4. In case of level scheduling, the synchronization (typically threads on a shared-memory system) confines itself to barriers [Cha+08] between the executions of the levels. This leads to a simple implementation and low synchronization costs. But it could be improved by an enhanced assignment of the tasks within a level to the processors to minimize the execution time of each level. However, this corresponds to the NP -complete problem of multi-way number partitioning where a given set of integers needs to be divided into a collection of subsets, so that the sum of the numbers in each subset are as nearly equal as possible [Kor09]. A famous greedy heuristic [Cor+01] for solving this problem is to sort the numbers (here: w(Ti), with i = 1, ..., n) in decreasing order and assign each one to the subset (here: processor) with the smallest sum so far. Since the partial order ≺,

P1 T1 T2 T4 T6

P2 T3 T5

Figure 6.4: Schedule for task graph in Fig. 6.2 with p = 2 using level scheduling

104 6.1 Parallelism in Simulation Models restricted on the tasks within a level, is empty (as the tasks within a level are independent), the ratio between the execution time resulting from 4 1 the greedy heuristic and the optimal execution time is limited by 3 − 3p [Gra69]. This can be an acceptable value in many cases but it must be kept in mind that the division of tasks into levels generally is not optimal. With the aid of the greedy heuristic, the two smaller tasks T1 and T3 in level 0 of the example shown in Fig. 6.3 are assigned to the first processor and task T2 to the second one (see Fig. 6.5), resulting in a shorter execution time of level 0 than before (see Fig. 6.4). The total execution time of all levels therefore reduces from 7 to 6. The next implemented method is the list scheduling introduced in the following.

P1 T1 T3 T4 T6

P2 T2 T5

Figure 6.5: Schedule for task graph in Fig. 6.2 with p = 2 using level scheduling considering execution times

List Scheduling

A comparison of list schedules for parallel processing systems is provided by [ACD74]. All of them accomplish the following steps:

1. Creation of a scheduling list (i. e. sequence of task to be scheduled) by assigning them priorities.

2. While the task graph is not empty: a) Assignment of the task with highest priority to the next available processor and b) removing of it from the task graph.

The difference between the algorithms lies in the determination of the tasks’ priorities. Two often used attributes for the assignment of priorities to tasks are the t-level (top level) and b-level (bottom level). The t-level of a task V ∈ T is the length (as defined in Def. 6.2) of a longest path from an entry task to V. The same applies to the b-level of a task V with the length of a longest path from it to an exit task.

105 Chapter 6 Exploiting Parallelism in Power Grid Simulation

T1(1, 5) T2(2, 4) T3(1, 3)

T4(3, 4) T5(1, 2)

T6(1, 1)

Figure 6.6: Example task graph including b-levels, with node label format Ti(w(Ti), b(Ti))

Definition 6.5 (B-level function) Given a task V ∈ T = {T1, ..., Tn} the b-level function b: T → N returns the b-level of the task V according to the recursive definition

w(V ), if {W ∈ T |V ≺ W } = ∅ b(V ) = w(V ) + max{b(W )|V ≺ W }, otherwise.

A critical path (CP) of a DAG is a longest path in the DAG and thus of high importance for a schedule (see [KA99] where also algorithms for t- and b-level computations are presented). In general, scheduling in a descending b-level order tends to schedule tasks on a CP first, while scheduling in an ascending t-level order tends to schedule tasks in topological order (for more on topological ordering see [KK04]). In [ACD74] the performance of different heuristic list scheduling algorithms is analyzed. It has been shown that the CP-based algorithms have near-optimal performance. One of these is the Highest Level First with Estimated Times (HLFET) algorithm. Another algorithm with a similar procedure but assuming a uniform execution time w(V ) = 1, for all V ∈ T , is the Highest Level First with No Estimated Times (HLFNET) algorithm. Figure 6.6 shows the example graph, extended by the b-level for each node. Using HLFET on it, results in an optimal schedule es shown in Fig. 6.7. More on these and other scheduling algorithms can be found in [KA99].

6.1.2 Task Parallelization in DPsim

The core part of the simulation toool is the actual simulation solver for power grid simulation. One of its main steps is calculating the system ma- trix A by iterating through a list of power grid components, accumulating each component’s contribution. The simulation at time point t can then

106 6.1 Parallelism in Simulation Models

P1 T1 T4 T6

P2 T2 T3 T5

Figure 6.7: Schedule for task graph in Fig. 6.2 with p = 2 using HLFET be sketched with the following steps:

1. computing the right-hand side vector b(t) by accumulating each component’s contribution (similar to the procedure composing the matrix A);

2. solving of the system equation Ax(t) = b(t);

3. updating components’ states (e. g. equivalent current sources) using the solution x(t).

These are just the major tasks as also others have to be performed in each step such as the simulation of the dynamics of the mechanical parts of electromechanical components like synchronous generators and simulation values must be exchanged between the time steps during a distributed simulation. Eventually, simulation results and logs are saved where this is needed. A single step is split into tasks defined by a list of tasks for each component which has to be simulated. Further tasks are added for the main step of system solving and optionally also for logging of results as well as data exchange with other processes (e. g. simulators) or HiL.

Task Dependency Analysis

For the representation of dependencies, a system of attributes is imple- mented. Attributes are properties of components such as, e. g., the voltage of a voltage source which are accessed during the simulation by a read or write. A task has two sets of attributes: one for attributes with read and one for those with write accesses. If an attribute is written by a task T1 and read by task T2 then T2 depends on T1 which is represented by a task graph as defined in Def. 6.1 for all tasks within one simulation step. The task graph for an example circuit (see Fig. 6.8) is depicted in Fig. 6.9. In PreStep, certain values necessary for the current simulation step (i. e. contributions to the right-hand-side vector) are computed depending on

107 Chapter 6 Exploiting Parallelism in Power Grid Simulation the solutions of the previous simulation step. In PostStep, certain com- ponent specific values are calculated from the system solution computed by Sim.Solve in the current simulation step. For optimization purpose, tasks that are not necessary in a certain simulation are omitted. In case of the Resistor component, e. g., a PostStep task is processed, calculating the current through it, based on the voltages from the system solution (e. g. calculated in Sim.Solve). More on the task dependency analysis can be found in [Mir20].

Task Schedulers

Before the actual simulation, a scheduler analyzes the task graph in order to create a schedule for the simulation using a certain number of concurrent threads which can be scheduled by the operating system on different parallel processors for potential execution time improvements. Several schedulers based on the presented scheduling methods (see Sect. 6.1.1) were implemented in DPsim as given in Tab. 6.1. Each scheduler has a createSchedule for initialization purpose based on the task graph and a step method called in the main simulation loop. The SequentialScheduler sorts the task graph in topological order for obtaining a valid task schedule for sequential processing. For the actual parallel processing, different

R1

+ _ V1 C1

Figure 6.8: Example circuit

V1.PreStep C1.PreStep

Sim.Solve

V1.PostStep R1.PostStep C1.PostStep Sim.Log

Figure 6.9: Task graph resulting from Fig. 6.8

108 6.1 Parallelism in Simulation Models

Table 6.1: Overview of the implemented schedulers Scheduler class name Short name Paradigm Algorithm SequentialScheduler sequential - Topological sort OpenMPLevelScheduler omp_level OpenMP Level scheduling ThreadLevelScheduler thread_level std::thread Level scheduling ThreadListScheduler thread_list std::thread HLF(N)ET

Application Programming Interfaces (APIs) are used: OpenMP [Ope19b], providing a simple interface for the (incremental) development of parallel applications and the std::thread class from the systems’ C++ Standard Library [Jos12]. The OpenMPLevelScheduler has the simplest implementation as it is utilizing the OpenMP API. Its step function (see List. A.1) forks a given number of concurrent threads (through a parallel section) in which a loop is processed by each thread sequentially (i. e. each thread processes each level). Within this level loop, an parallel loop over the tasks within a level is executed with an OpenMP schedule(static) clause, causing a nearly equal distribution of the tasks among the threads. As a parallel for-loop in OpenMP has an implicit barrier per default, the concurrent threads process the levels synchronously. An advantage of OpenMP is that there are many implementations for different computer platforms but there could be significant differences in computing performance [Mül03]. Also, the simple OpenMP pragmas allow an incremental development but also prevent influence over some implementation details. The ThreadScheduler was implemented based on the std::thread class from the C++ standard library [Wil19], implementing the step function for having more control over the synchronization between the threads. In every time step, each thread executes its list of assigned tasks successively and synchronized by atomic counters supporting the two operations: an atomic increment of the counter’s value and waiting until it reaches a given value which is implemented in form of busy waiting [Tan09]. The counter of each task is incremented after its processing. Before each step, the atomic wait method is called on the counters of all tasks with an edge (in the task graph) to the current task. The actual distribution of the tasks among the threads is accomplished by the two sub classes of the ThreadScheduler. The ThreadLevelScheduler, like OpenMPLevelScheduler, realizes level scheduling but with a different behavior. In case of the OpenMP-based scheduler, there are barriers for all threads at each level’s end, causing also

109 Chapter 6 Exploiting Parallelism in Power Grid Simulation threads without tasks within a certain level to wait before the execution of (independent) tasks of the next level. Such unnecessary barriers are not conducted by the ThreadLevelScheduler. Moreover, it can make use of ex- ecution times measured during a previous execution by applying the greedy heuristic for multi-way partitioning to keep the subsequent execution time per level between the threads mostly uniform (see Sect. 6.1.1). The ThreadListScheduler which also derives from ThreadScheduler implements the list scheduling algorithm based on HLFET, in case ex- ecution times are provided, and Highest Level First with No Estimated Times (HLFNET) if not (e. g. the execution times per task is assumed to be uniform).

Component-Based Modified Nodal Analysis The system to be simulated is passed as a list of component objects to a MNA solver, implemented with the MNASolver class. All components that can be simulated using the MNA approach, have the following in common: • their internal state is initialized depending on the system frequency and time step;

• their presence may change the system matrix;

• they specify tasks such as PreStep and PostStep which have to be processed at each time step. At simulation start, each component is initialized, its contribution is accumulated to the system matrix, and the decomposition is calculated. More details on the MNA implementation itself can be found in [Mir20]. A Simulation class constructs the task graph from the given list of tasks and such for logging as well as interfacing if needed. During simulation, the scheduler’s step method (for proceeding in time) is called which executes all tasks in a correct order (i. e. avoiding race conditions). Because of the distinction in the implementation between the scheduler and solver, the implemented framework for parallel processing is not MNA solver specific but can be adapted to any solver structure which, however, must be divisible into tasks.

6.1.3 System Decoupling

Solving a linear system of size n requires O(n3) operations which leads to long execution times in case of large matrices. Even if the system matrix stays fix between the simulation steps, leading to the fact that a LU decomposition of it could be reused for solving the system, the

110 6.2 Analysis of Task Parallelization in DPsim forward-/backward substitutions would require O(n2) operations at each time step. Because of requirements on the time step in real-time simulation (dependent on the simulation model / method and use case), this would cause a limit in the size of the system model. A possible proceeding is, to split the system into smaller matrices that can be solved independently and to compose the solution of the whole system from all partial solutions. In case, the LU decomposition can be reused, the potential speedup by solving n2 k systems of size n/k over solving a system of size n is k·(n/k)2 = k. As the smaller matrices are independent, they can be solved concurrently which results in a higher performance. Therefore, two methods for increasing the performance gain from the presented parallelization by splitting the system matrix into smaller parts were implemented.

Decoupled Transmission Line Model

The application of the TLM (in literature also called decoupled transmission line model) which belongs to the explicit parallelization approaches using computational components (see Sect. 6.1), can split a grid into two subgrids which are not topologically connected. This allows the creation of two separate system matrices that can be solved concurrently during each time step. DPsim automatically recognized such cases, solves the systems separately and simulates the line behavior of the equivalent components, connecting the two subnetworks together.

Diakoptics

Diakoptics is another method which allows the user to divide a grid into subgrids. The resulting subgrids can also be computed concurrently and their results can be calculated to the whole solution. More on the implementation of TLM and diakoptics in DPsim can be found in [Mir20].

6.2 Analysis of Task Parallelization in DPsim

In the following, the performance benefits of the previously introduced parallelization methods are analyzed on models without and with system decoupling. For that purpose, the average wall clock time needed for a single simulations step is used as metric in all analyses. It was chosen because of its importance for soft real-time simulation where the elapsed times of all time steps must stay below a specified average. At first, the execution times for the different schedulers are analyzed for several system model sizes. Afterwards, the effect of the parallelization on the system

111 Chapter 6 Exploiting Parallelism in Power Grid Simulation decoupling methods is investigated. Finally, the parallel performance is compared when DPsim is built by various popular compiler environments.

Measurements Environment All measurements in this section where accomplished on a server with 2 sockets, each with an Intel Xeon Silver 4114 2.2 GHz (3.0 GHz Turbo), 10 cores CPU with Hyper-Threading (HT); 160 GB DDR4 main memory; running an x86_64 Ubuntu 16.04 Server Linux with gcc v8.1.0 as default compiler environment for DPsim.

6.2.1 Use Cases The Western System Coordinating Council (WSCC) 9-bus transmission benchmark network was used as reference network which consists of three generators, each connected to a power transformer, and three loads con- nected to the generators by six lines in a ring topology. The whole network (e. g. system model) as depicted in Fig. 6.10 was provided in form of a CIM-based file. Its components were modeled in the following way: • synchronous generators represented with the aid of an inductance and an ideal voltage source whose value was updated in each step based on a model for transient stability studies;

• power transformers modeled as ideal transformers with an additional resistance and inductance on the primary side to model in particular the electrical impact of the windings and related power losses;

• transmission lines represented by PI models with additional small so-called snubber conductances to ground at both ends;

• loads modeled as having a constant impedance and inductive behav- ior, thus represented by a resistance and inductance in parallel. More on the component models can be found in [Mir20]. For an analysis of various system model sizes, multiple replications of the WSCC 9-bus system were combined in an automated way. For this purpose, further transmission lines were added between nodes connected to loads (labeled in Fig. 6.10 as BUS5, BUS6 and BUS8) to form further rings between components of the system copies. The resulting topologies for two and three system copies are illustrated in Fig. 6.11 where different node colors signify different copies of the original 9-bus system and newly added transmission lines are represented using solid lines. Only relevant buses are shown and the omitted parts are sketched as dashed lines.

112 6.2 Analysis of Task Parallelization in DPsim

GEN3 DP::Ph1::SynchronGeneratorTrStab BUS3 (14.14 kV > 4.88°) LOAD6 DP::Ph1::RXLoad

TR39 DP::Ph1::Transformer BUS6 (222.23 kV > -3.74°) LINE96 BUS9 DP::Ph1::PiLine (230.95 kV > 2.11°)

LINE64 DP::Ph1::PiLine BUS1 (17.16 kV > 0.00°) GEN1 DP::Ph1::SynchronGeneratorTrStab LINE89 BUS4 TR14 DP::Ph1::PiLine (229.02 kV > -2.31°) DP::Ph1::Transformer

BUS8 (225.21 kV > 0.84°) LOAD8 LINE54 DP::Ph1::RXLoad DP::Ph1::PiLine

LINE78 DP::Ph1::PiLine BUS5 (218.66 kV > -4.14°) BUS7 (229.40 kV > 3.97°) LINE75 DP::Ph1::PiLine LOAD5 DP::Ph1::RXLoad

TR27 DP::Ph1::Transformer

BUS2 (18.45 kV > 9.69°)

GEN2 DP::Ph1::SynchronGeneratorTrStab

Figure 6.10: WSCC 9-bus transmission benchmark network

6.2.2 Schedulers

In the first part of the scheduler analysis, the different schedulers were compared on various benchmark networks of different sizes. In Fig. 6.12 the average wall clock times per step for simulating the 9-bus system were plotted for each implemented scheduler, depending on the number of threads from one to ten (due to a 10 cores server). The simulation had a step of 100 µs, was 100 ms long and the average execution time for a single time step was calculated on the execution times of 50 simulations. The scheduler names in the plot’s legend are as defined in Tab. 6.1, whereby the adjunct meas indicates that the measured average task execution times were passed to the scheduler. Compared to the sequential scheduler (dashed line) the parallel pro- cessing as scheduled by all methods is slower than sequential processing. All schedulers, except the OpenMP-based one with an addition overhead,

113 Chapter 6 Exploiting Parallelism in Power Grid Simulation

BUS6' BUS6'

BUS8' BUS5' BUS8' BUS5' BUS6''

BUS6 BUS6 BUS8'' BUS5''

BUS8 BUS5 BUS8 BUS5

(1) Two system copies (2) Three system copies

Figure 6.11: Schematic representation of the connections between system copies

·10−5

1.2

1

0.8

0.6 sequential omp_level 0.4 thread_level thread_level meas Wall clock time per step [s] 0.2 thread_list thread_list meas 0 2 4 6 8 10 Number of threads

Figure 6.12: Performance comparison of schedulers for the WSCC 9-bus system

114 6.2 Analysis of Task Parallelization in DPsim lead to similar execution times which are increasing with the number of threads. Therefore, the same benchmark was performed on a network with 20 interlinked copies of the 9-bus system. For this larger system the par- allel processing for all schedules performed better than the sequential one as depicted in Fig. 6.13. Here the OpenMP-based level scheduler

·10−3

2.5

2

1.5

1 sequential omp_level thread_level thread_level meas

Wall clock time per step [s] 0.5 thread_list thread_list meas 0 2 4 6 8 10 Number of threads

Figure 6.13: Performance comparison of schedulers for 20 copies of the WSCC 9-bus system implementation lead to the highest speedup of ∼1.27 in relation to se- quential processing but again there are only slight differences between the schedulers. At the end of the scheduler analysis, the number of threads was fixed at eight (i. e. a few less than cores to reduce context switching caused by other system threads on the same CPU) whereas the system size was varied up to forty 9-bus copies. The resulting average execution times for a single time step plotted in Fig. 6.14 were calculated on 10 measurements because of rising overall simulation times for larger systems. From fifteen 9-bus copies on, the parallel processing shows an performance improvement over sequential processing and again there is no relevant difference between

115 Chapter 6 Exploiting Parallelism in Power Grid Simulation

·10−2

1.5 sequential omp_level thread_level thread_level meas thread_list 1 thread_list meas

0.5 Wall clock time per step [s]

0 0 10 20 30 40 Number of system copies

Figure 6.14: Performance comparison of schedulers for a varying number of copies of the WSCC 9-bus system

the parallel schedulers. Furthermore, the required simulation time grows quadratically with the system size.

As usual, it can be seen that a system must have a certain size to make use of parallelization during parallel processing as the synchronization between multiple threads, realized by OpenMP barriers and a counter mechanism of the other schedulers, requires too much time compared to the actual simulation computations. In the dependency graph (see Fig. 6.15), where the area of a circle (representing a task) is proportional to its execution time, it can be seen that most time is spent for one single task which solves the system equation. As the system to be solved is growing quadratically with the number of nodes, the parallelization speedup is limited by the concerned SolveTask. This is also the reason for the little differences between the various schedulers, as there is only a small number of meaningfully different schedules. The reduction of a big SolveTask to multiple smaller subtasks was therefore the main motivation for system decoupling in the following.

116 6.2 Analysis of Task Parallelization in DPsim

6.2.3 System Decoupling

In this analysis, the impact of the parallelization methods on decoupled systems was examined. For this, the 9-bus system copies were connected as described before. Then, the TLM is applied on the added transmission lines. In a second case, the added transmission lines are used as so-called splitting branches for the diakoptics method [Mir20]. Again, the simulation had a step of 100 µs, was 100 ms long and the average execution time for a single time step was calculated on the execution times of 10 simulations. At first, the parallel performance of an increasing number of sys- tems using the TLM is depicted in Fig. 6.16 exemplarily for the OpenMP LevelScheduler and the ThreadLevelScheduler (without any informa- tion about the execution times of the tasks in a previous step) depending on the number of deployed threads. The parallel processing leads to much lower execution times in case of both schedulers and scales up to 8 threads on the utilized 10 cores system although the execution times needed by sequential processing are already much slower than without TLM. The maximum achieved speedups with 8 as well as 10 threads in relation to sequential execution are around two orders of magnitude. The TLM performance of all schedulers was measured with 8 threads and is shown in Fig. 6.17. There, the average execution time per step is nearly the same for all schedulers. It does not grow linearly with the system copies (as the solving of the decoupled subsystems grows quadratically) and the plots have sharp increases at some points which could stem from system size which does not fit in the cache of a certain level anymore leading to higher latencies while accessing the cache of the next level resp. the main memory. Similar measurements were performed using diakoptics instead of TLM as depicted in Fig. 6.18. Again, the parallel processing scheduled by the OpenMPLevelScheduler and ThreadLevelScheduler show a higher performance compared to se-

Figure 6.15: Task graph for simulation of the WSCC 9-bus system

117 Chapter 6 Exploiting Parallelism in Power Grid Simulation quential processing with maximum speedups of around one order of mag- nitude. Unfortunately, the speedup from two to more threads is very limited. The diakoptics performance of all schedulers was measured with 8 threads and is shown in Fig. 6.19. Here as well, the parallel processing based on all schedulers leads to very similar execution times but without any regular sharp increases as in case of the parallel processing on decoupled systems using TLM.

118 6.2 Analysis of Task Parallelization in DPsim

·10−4

sequential 2 threads 6 4 threads 8 threads 10 threads

4

2 Wall clock time per step [s]

0 0 10 20 30 40 Number of system copies (1) OpenMPLevelScheduler

·10−4

sequential 2 threads 6 4 threads 8 threads 10 threads

4

2 Wall clock time per step [s]

0 0 10 20 30 40 Number of system copies (2) ThreadLevelScheduler

Figure 6.16: Performance for a varying number of copies of the WSCC 9-bus system using the decoupled line model

119 Chapter 6 Exploiting Parallelism in Power Grid Simulation

sequential omp_level thread_level thread_level meas thread_list thread_list meas

10−3

10−4

−5 Wall clock time per10 step [s]

0 10 20 30 40 Number of system copies

Figure 6.17: Performance comparison of schedulers for a varying number of copies of the WSCC 9-bus system using the decoupled line model with 8 threads

120 6.2 Analysis of Task Parallelization in DPsim

·10−3

sequential 2 2 threads 4 threads 8 threads 1.5 10 threads

1

0.5 Wall clock time per step [s]

0 0 10 20 30 40 Number of system copies (1) OpenMPLevelScheduler.

·10−3

sequential 2 2 threads 4 threads 8 threads 1.5 10 threads

1

0.5 Wall clock time per step [s]

0 0 10 20 30 40 Number of system copies (2) ThreadLevelScheduler.

Figure 6.18: Performance for a varying number of copies of the WSCC 9-bus system using diakoptics

121 Chapter 6 Exploiting Parallelism in Power Grid Simulation

sequential omp_level thread_level thread_level meas thread_list thread_list meas

10−3

10−4

Wall clock time per10 step [s] −5

0 10 20 30 40 Number of system copies

Figure 6.19: Performance comparison of schedulers for a varying number of copies of the WSCC 9-bus system using diakoptics with 8 threads

6.2.4 Compiler Environments

The performance of the parallelization does not only depend on the schedul- ing methods but also on the parallelization paradigms (OpenMP and C++11 threads) of the used compuiler environments and their optimiza- tions. Table 6.2 lists three compiler environments that are nowadays often used in the scientific area together with the applied optimization level for comparable results (i. e. programs). The simulation was repeated with all three compilers, having a time step of 100 µs and a duration of 100 ms. The average execution time for a single time step was calculated on the execution times of 50 simulations as presented in Fig. 6.20. The gcc and

122 6.2 Analysis of Task Parallelization in DPsim

gcc (sequential) gcc (omp_level) gcc (thread_level) clang (sequential) clang (omp_level) clang (thread_level) icc (sequential) icc (omp_level) icc (thread_level)

·10−3 3

2

1 Wall clock time per step [s]

0 2 4 6 8 10 Number of threads

Figure 6.20: Performance comparison of compilers for 20 copies of the WSCC 9-bus system

123 Chapter 6 Exploiting Parallelism in Power Grid Simulation

Table 6.2: Overview of the tested compilers Compiler Version Flags Reference GNU Compiler Collection (gcc) 8.1.0 -O3 -march=native [GCC] Clang (clang) 7.0.1 -O3 -march=native [Cla] Intel C++ Compiler (icc) 19.0.1.144 -O3 -xHost [Int] icc compilers lead to a comparable performance for all schedulers which, in case of the small system to be simulated, with parallelization is lower than with sequential execution. The executable compiled with clang, however, has the lowest performance. Therefore, simulations with same parameters as before were performed on a system model consisting of twenty interlinked 9-bus copies. The plots for all compilers are quantitatively similar to the ones in Fig. 6.13. For all compilers the parallel processing on this larger system model achieves lower execution times than with sequential processing. Here again, gcc yields to the highest performance.

6.3 Conclusion and Outlook

This chapter provides an overview of approaches exploiting parallelism in mathematical models. In addition to the three known approach types in simulation area [Lun+09], it introduces a new Automatic Fine-Grained Parallelization of Mathematical Models which was implemented in DP- sim; an existing power grid simulation software. After a presentation and formal definition of different scheduling methods which are used for the implementation of different parallelization methods belonging to the new approach type, i. a. the task dependency analysis, and two system decoupling methods (TLM and diakoptics) are sketched. The subsequent analysis of the task parallelization methods implemented in DPsim for shared-memory systems has shown sublinear speedups for small systems models with execution times per simulation step increasing with the number of used CPUs. However, in case of larger system models (with more than 100 nodes) in combination with TLM superlinear speedups have been achieved. Unfortunately, TLM has some restrictions on the simulation time steps as well as the types of transmission lines for which it can be applied and it also potentially introduces inaccuracies for higher frequencies. The utilization of diakoptics, which does not introduce such disadvantages, leads to parallel speedups when applying the implemented parallelization methods.

124 6.3 Conclusion and Outlook

On the 9-bus system model, the various scheduling algorithms had almost no performance differences in many cases. Moreover, existing differences are not caused by the different scheduling concepts but instead by the particular parallelization paradigm and the compiler environment. The reason is the general structure of the task dependency graphs which leave only little flexibility for the algorithms to generate strongly differing schedules. Anyway, as the task dependency graphs depend on the system models, a comprehensive analysis with different models could result in a variety of execution times depending on the parallel scheduling method. The implemented task dependency analysis is general enough to intro- duce a finer-grained inner task parallelization. For instance, GPUs on a heterogeneous architecture could be utilized as accelerators (e. g. for computations of complex component models) by porting task related codes for CPUs to GPU kernel code. Then, the schedulers could deploy tasks among CPUs and GPUs. A utilization of further parallel programming paradigms for distributed-memory architectures, such as Message Passing Interface (MPI), could be considered but only in case of very large system models because of the higher latencies usually introduced by interactions (i. e. memory accesses and synchronizations) between different computer nodes. Furthermore, optimization efforts of the processing within tasks were begun by the usage of explicit single instruction multiple data (stream) (SIMD) vectorization where vector instructions (such as Advanced Vector Extensions (AVX)) of modern CPUs are utilized. With these, a higher performance can be achieved if same CPU instructions are performed on vectors instead of scalars. Modern compilers already perform automatic vectorization but only for parts of the code where they can assure cor- rectness by a static code analysis and they can recognize only in case of certain control flow patterns. In more complex computations, explicit vectorization can be enabled by the programmer, e. g. using OpenMP SIMD compiler directives or by SIMD compiler intrinsics.

125

7

HPC Python Internals and Benefits

In the past decade, Python developed to be one of the most popular programming languages. In many rankings of the most widely used programming languages it is on one of the first three positions [Cas19]. Especially in the engineering sector it enjoys a steadily growing popularity as it is said to be easy to learn because of its clear syntax and a relatively small set of keywords. Furthermore, there are several open-source Python implementations with a comprehensive standard library available for free. It allows, e. g., object-oriented as well as functional programming and the very portable Python implementations allow the use on many platforms. As Python programs are usually interpreted, they do not need to be compiled, which is why it is often used as a script language for small tasks. But besides the duration and simplicity of software development also the time efficiency of a programming language is crucial, especially in scientific computing. Python’s simple syntax and automatic memory management leads to short development times in comparison to other programming languages. However, the execution times of interpreted programming languages usually is considerably higher than of compiled languages. Therefore, various language extensions, optimized interpreters and com- pilers are developed to increase the time and memory efficiency of Python programs. Important representatives are the Python package NumPy [VCV11; Numb], the just-in-time (JIT) compilers numba [LPS15; Numa] and PyPy as well as the language extension Cython [Beh+11; Cyt]. But if an engineer, for instance, developed a software project in Python with all

127 Chapter 7 HPC Python Internals and Benefits needed features but insufficient performance, the question arises, which of the mentioned solutions should be taken for which kind of algorithms. Around these efforts, a scientific community has grown in the past years with conferences on Python for High-Performance and Scientific Comput- ing [Ger]. However, no systematic comparative analysis on the methods improving Python’s runtime performance has been accomplished so far. In a blog post [Pug16] an execution time comparison based just on a LU decomposition between Python 3, C, and Julia was shown in combination with the (JIT) compilers Cython and numba as well as the modules NumPy and SciPy [Bre12; Scid], containing numerical algorithms based on NumPy. The result of this benchmark was that the execution of conventional Python was one decimal power slower than C and Julia. With the applied improvements, however, the performance of the Python solution was similar to C and Julia. The execution time of the SciPy-based implementation was even more performant than the ones in C and Julia when using precompiled functions of the SciPy and NumPy modules. Except the conventional Python 3 solution, each implementation was optimized for vector CPU instructions. In [Rog16] a benchmark of Python runtime environments was presented. The comparison was accomplished with the conventionally used reference C-implementation CPython [Pytb] and the Java-implementation Jython of the Python interpreter on the one hand as well as PyPy and Cython as compilers on the other hand. The results are interesting as Jython achieves a higher performance than CPython and Cython is as fast as CPython. The latter is the case because the Cython version was not adapted to make use of Cython’s features which will be introduced later in this chapter. Furthermore, for this benchmark only Python 2 was used which is now deprecated and Python 3 is not backward compatible to Python 2. The available benchmarks focus on the execution time only. For a holistic view of the solutions, the memory consumption must be taken into account as well which has not been considered in the previous analyses. Therefore, this chapter presents a comparative analysis of the currently most popular performance improvement solutions for Python programs on different kind of standard algorithms from the area of numerical methods and operations on common abstract data types (ADTs). These algorithm implementations based on the various Python solutions are compared with reference implementations in C++ which is meant to be a time and memory efficient object-oriented programming language. The comparative analysis presented here does not only compare the execution times of the programs but also their memory consumption. It shall provide Python programmers an overview of current solutions to improve the performance of their Python programs. Moreover, it shall provide them information on

128 7.1 HPC Python Fundamentals how much effort is required for the application of a certain solution on the one side and which gain can be expected on the other side. The chapter gives an introduction to HPC relevant properties of the Python language and its reference implementation CPython. It follows a short introduction of the aforementioned Python runtime environments with a focus on their different approaches. Hereafter the benchmarking methodology based on representative algorithms is presented. The al- gorithms are used for the comparative analysis on the execution times and memory consumption in case of the various Python environments, presented in the following section. Finally, a conclusion on the comparative analysis is given with an outlook of future work. This chapter presents outcomes of the supervised thesis [Kas17].

7.1 HPC Python Fundamentals

Before the available Python environments are presented, a short overview of the HPC relevant peculiarities of Python is given. Usually high-level languages (HLLs) like Python are structured in a way that humans are able to read and maintain them easily and reuse certain parts of the program, and so forth. Hence, before such programs can be executed on a central processing unit (CPU), the source code must be transformed into a sequence of instructions of the actual CPU. This can be accomplished, for instance, with an interpreter or compiler.

Interpreter

An interpreter processes the source code at the run time of a program. It reads the program’s source code, analyzes or even preprocesses it, and executes the statements by translating them successively into instructions of the target CPU. In case of Python programs interpreted by the CPython environment, the preprocessing consists of a transformation of the Python code into an intermediate format, the bytecode (stored in .pyc files), for a virtual machine [Ben]. And the Python interpreter is an implementation of that virtual machine. The successive execution of source code by an interpreter makes the programming language usable for scripting and usually allows a better error diagnosis [Aho03]. However, this has the disadvantage of a tenden- tially slower execution speed of interpreted in comparison with compiled programs.

129 Chapter 7 HPC Python Internals and Benefits

Compiler

A compiler for HLLs usually translates the whole relevant source code to executable machine code (i. e. instructions of the target CPU). It can also generate intermediate codes from the source code but the main difference of this approach, in contrast to an interpreter, is that the program after the compilation process is available in a form that can be executed on the CPU directly. The direct execution of the program instructions on the CPU leads to a high execution speed but disadvantages are, e. g., that the machine code is CPU architecture dependent and must be compiled again for different computer platforms. The same applies in case of source code changes. Such compilers are therefore also called ahead-of-time (AOT) compilers.

Just-in-Time Compiler

In contrast to AOT compilers, JIT compilers translate the source code mostly during run time. Only those parts of the program that need to be executed are compiled. JIT compilers can be used to increase the execution speed of interpreted programs when the execution of the compiled part of the source code is so much faster than its interpretation that the compilation process of the JIT compiler does not have a negative effect on the whole execution time. Once compiled parts do not need to be compiled again in case of multiple executions such as in loops.

Tracing Just-in-Time Compiler

A tracing just-in-time (TJIT) compiler makes use of the assumption that most of a typical program’s run time is spent in loops [Bol+09]. Therefore, a TJIT compiler tries to identify often executed paths within loops. The instruction sequences of such execution paths are called traces. After their identification, the traces are usually optimized and translated to machine code.

7.1.1 Classical Python

Python is continously further developed. Currently Python is available in version 3, which has many new features breaking backward compatibility to version 2 [Tad][Rosb]. The Python version numbers refer to the major version numbers of the reference Python interpreter implementation CPython [Pytb]. After around 20 years of development, Python 2 will be retired and the last CPython 2.7.18 was released in April 2020 [Pytd].

130 7.1 HPC Python Fundamentals

Nevertheless, Python 2 was considered in this dissertation since there is still much Python 2 code that has not been ported to Python 3.

Data Types

In Python, variables are not declared and can be used without a data type definition. Everything is an object in Python and associated with a certain data type [Mon]. A Python variable can reference different objects of different types. And the type of an object is determined dynamically at run time with the aid of their attributes and methods which is called Duck Typing [FM09]. There are so-called mutable and immutable objects. Objects that are, e. g., of the type int, float, bool, and tuple are immutable. An instance of an immutable data type has a constant value which cannot be changed. Multiple variables with the same value are not referencing multiple in- stances. Instead, the same instance is referenced by all these variables. In contrast, a mutable instance can change its value during run time which is why the same mutable objects are created in the memory each time they are newly requested. Mutable objects are, e. g., of the type list, dict, and set [Cara]. Python 3 distinguishes between several types for numbers. Integers are of the type int and have an arbitrary precision. In Python 2 an int represents an integer value with 64 bit and the type long int corresponds to int of Python 3. An instance of type list is a sequence of objects that can have an arbitrary type. The content of the list can be changed during run time and the objects can be mutable or immutable. Unlike a list, the content of a tuple cannot be modified during run time. An object of type dict is an associative data field which consists of key-value pairs. The keys, which can only be of an immutable type, refer to objects of an arbitrary type.

Parameter Passing

The two most common evaluation strategies for parameters during a function call in HLLs are call by value and call by reference. In the first case, the value of the given expression (passed to the function) is assigned to the function’s parameter. In the second case, the object that is referenced by the given expression is also referenced by the functions’ parameter within the function. The latter leads to the fact that the object’s value is changing in the calling part of the code when it is changed within the called function.

131 Chapter 7 HPC Python Internals and Benefits

Python, however, uses the mechanism referred to as call by object (reference) [Kle]. If a variable x in main is passed to a function as parameter y, then x and y refer to the same object. This behavior corresponds to call by reference. If, subsequently, another object is assigned to y within the function, y refers to the new object and x in main stays untouched which corresponds to a call by value behavior.

Side Effects The call by object reference principle can cause side effects. If a mutable object, e. g., of the type list, referenced in main by the variable l is passed to a function with a parameter m, all modification to m within the function are apply also to the list in main. To avoid this, a copy of the list can be passed with the aid of the slicing function which can be used by writing l[:] instead of l as argument in the function call.

NumPy Module NumPy, standing for Numerical Python, is a package for scientific com- puting with Python [Numb]. It contains an N-dimensional array object implementation (called ndarray), functions which can work on such arrays, tools for integrating C/C++ and Fortran code, and linear algebra Fourier transform as well as random number capabilities. The ndarray can be used for numerical computations instead of the normal Python list. All elements of the ndarray must be of the same data type as the NumPy package is implemented with the use of C and therefore can benefit of static typing at compile time for higher run time efficiency. Possible data types that can be used are, e. g., bool_, int_, float_, and complex_ for equivalent C-types as shown in [SWD15]. A one-dimensional ndarray with n 64 bit floating-point numbers containing zeros can only be created as follows: numpy . zeros (n, float ) An ndarray provides a Python object around a functionally extended C-array. The following Python code shows a matrix-matrix multiplication: for i in range (n): for j in range (n): for k in range (n): C[i][j] += A[i][k]* B[k][j] A usage of ndarrays for the matrices would lead to an additional overhead in the innermost loop. The overhead would occur at the border between the pure Python code around the +=-statement (i. e. the three loops) and the NumPy code executed during the evaluation of the statement. In the

132 7.1 HPC Python Fundamentals case of a 10 × 10 matrix the border within the 3 loops would be passed 103 = 1000 times. That could make the program execution slower than with the normal Python list which is why NumPy functions should be called for sufficiently long processing on the provided data only. Instead of applying pure Python operations over the entries of an ndarray it is recommended to apply operations over the whole ndarray in C code. For this purpose, NumPy provides precompiled functions implemented in C as the following one that can be used for the multiplication of two matrices: numpy . dot (A, B)

With this function the border between Python and NumPy code will be passed just once. Moreover, the precompiled functions of NumPy for linear algebra make use of BLAS [Uni17] and LAPACK [Uni19]. In comparison to Python lists, the ndarray generates less overhead with regard to execution time and memory usage [Coh] as it consists of continuous memory blocks (at least in virtual memory) whereas a Python list consists of pointers to memory blocks which can be randomly distributed in the memory which is unfavorable for CPU caches as depicted in Fig. 7.1.

Array Module Python’s array module defines an object type which can compactly repre- sent an array of basic values of one C data type such as, e. g., char, int, float, double, etc. Hence, the module is also implemented using C but is not as powerful as ndarray because only one-dimensional arrays can

NumPy Array Python List ...... PyObject_HEAD PyObject_HEAD ...... data 1 length ...... 2 ...... dimensions 3 items 0x133718 ...... 4 0x133748 ...... strides 5 0x133730 ...... 6 0x133760 ...... 7 0x133700 ...... 8 0x1337b8 ...... 0x1337d0 ...... 0x1337e8 ...

Figure 7.1: NumPy ndarray vs. Python list [Van]

133 Chapter 7 HPC Python Internals and Benefits be defined and there are no precompiled functions. More on this can be found in [Pyta].

Memory Management in CPython The reference implementation CPython comes with an automatic memory management based on so-called reference counting and a garbage collector (GC).[Dig]. Each Python object has a reference counter which is increased when the object is referenced once more and decreased if a reference is dissolved. If the reference count equals zero, the memory allocated for the object can be freed. However, the reference counting of CPython cannot detect reference cycles which can occur, for instance, when one or more objects are referencing each other [Glo]. Therefore, CPython has a generational cyclic GC that runs periodically, determining reference cycles for freeing the memory occupied by objects which are referencing just themselves. As the garbage collection interrupts the execution of the Python program, there are certain thresholds that can be adjusted. More on that can be read in [Debb].

Architecture of the CPython Environment The software architecture of the CPython environment is depicted in Fig. 7.2. Before CPython can be used, it must be compiled from the CPython source code by a proper C compiler. The resulting python program can then be applied on the Python code to be executed which is translated to bytecode and interpreted by the bytecode interpreter as CPU instructions of the target CPU. The bytecode interpreter is implemented in form of a stack-based virtual machine (VM) [Ben]. For the Python function def add (a, b): z = a + b return z the following sequence of bytecode instructions is executed by the VM. First, the two operands a and b are pushed on the stack by the LOAD_FAST instruction. Then the BINARY_ADD instruction pops the two operands from the stack, performs the addition, and pushes the result onto the stack. A STORE_FAST instruction stores the result in z which is then pushed again on the stack by a further LOAD_FAST, to be returned by a RETURN_VALUE instruction. The data types of the objects (here: a, b, and z) are determined at the execution of each bytecode instruction. Therefore, a BINARY_ADD during one call can be performed onto two integer values as well as on two lists

134 7.1 HPC Python Fundamentals during another call. This makes the interpretation process very flexible but also much more time-consuming than the direct execution of machine instructions from a compiled program. For example, the call of BINARY_ADD on two integer values consists of following steps: 1. Determine data type of a

2. a is an int: get value of a

3. Determine data type of b

4. b is an int: get value of b

5. Call C function int binary_add(int, int) on values of a and b

6. Result of type int will be stored in result

Parallel Processing in CPython CPython allows multithreading with the aid of the threading module [Pyte] which is based on POSIX threads on a Portable Operating System Interface (POSIX) conform operating system [IEE18] which is mapping the Python threads to native threads of the operating system. However, because of a global interpreter lock (GIL) the Python threads within one

CPython C Compiler Source (C)

python

Python Bytecode Bytecode Code Interpreter

Maschine Code

Computer Platform

Figure 7.2: Software architecture of CPython (python command)

135 Chapter 7 HPC Python Internals and Benefits

CPython interpreter are not really running concurrently. The reason for the GIL is i. a. the automatic memory management by reference counters as explained above. Without the GIL in CPython, multiple threads that are using the same Python object could increment and decrement its reference counter concurrently. This could lead to a race condition on the reference counter resulting in a wrong value. Besides the memory management also global variables as well as mutable objects cause issues for a thread-safe program execution: if a thread modifies a global variable, another thread could use an old value – the same applies to mutable objects. Therefore, a Python thread in CPython must hold the GIL to be able to execute bytecode instructions. How the GIL is assigned to the threads depends on the CPython version. If multiple threads are created, one gets the GIL and the others wait (blocking) on it. In Python 2 a check is implemented which counts the ticks (bytecode instruction) since the creation of a new thread [Bea]. After 100 ticks the active thread is yielding the GIL and all inactive threads get a signal for requesting the GIL. One of them gets it and continues with the execution of its bytecode while the other threads wait on the GIL. In Python 3 each thread gets a time interval of 5 ms instead of ticks [Gir]. After each interval the GIL is yielded and assigned to the next thread in a row. This avoids a competition between the threads leading to a fairness of task scheduling. CPython also provides the multiprocessing module [Pytc] with which child processes can be created within a Python process. Each child has its own process memory that is independent from other processes. Hence, the memory management, global variable, and so forth are no issue for concurrently running processes belonging to the same process tree. The communication between such processes can be performed with the aid of a Manager object. All previously presented Python peculiarities are important to under- stand what the Python environments other than CPython do differently to achieve a higher run time performance. These Python environments will be presented in the following.

7.1.2 PyPy Contrary to CPython, PyPy’s Python interpreter, implementing the full Python language, is written in Restricted Python (RPython) rather than in C. RPython is a restricted subset of Python and therefore suitable for static analysis. For instance, variables should contain values of at most one type at each control flow point [Min]. The PyPy interpreter was

136 7.1 HPC Python Fundamentals written in RPython as the language was designed for the development of implementations of dynamic languages. RPython code can be compiled by the RPython translation toolchain [PyPe] as it is done for the PyPy interpreter. Due to a separation of language specification of the dynamic language to be implemented and implementation aspects, the RPython toolchain can automatically generate a JIT compiler for the dynamic language. As a subset of Python, RPython can also be interpreted by an arbitrary Python interpreter [Min].

Architecture of the PyPy Environment

The software architecture of the PyPy environment is depicted in Fig. 7.3. Here, the program which runs the Python code to be executed is called pypy and must by compiled from the PyPy source code with the RPython toolchain. Similar to CPython, first the Python program is compiled to bytecode which is also processed by a stack-based virtual machine [PyPa]. The important difference between the CPython and PyPy interpreter is that the latter delegates all actual manipulations of the users’ Python objects to a so-called object space which is some kind of a library of built-in types [PyPc]. Hence, PyPy’s interpreter treats the Python objects as black boxes.

PyPy Source RPython (RPython) Toolchain

pypy

Python Bytecode Bytecode Tracing JIT Code Interpreter

Maschine Code

Computer Platform

Figure 7.3: Software architecture of PyPy (pypy command)

137 Chapter 7 HPC Python Internals and Benefits

The BINARY_ADD in PyPy is implemented as follows [BW12]: def BINARY_ADD (space , frame ): object1 = frame . pop () # pop left operand off stack object2 = frame . pop () # pop right operand off stack result = space . add ( object1 , object2 ) # perform operation frame . push ( result ) # record result on stack The interpreter pops the two operand objects from the stack and passes them to the add method from the object space. In contrast to CPython, the PyPy interpreter does not determine the types of the objects which is why the latter does not need to be adapted when new data types need to be supported. The TJIT compiler, automatically generated by the RPython toolchain, uses meta-tracing [Bol+09]. Therefore, at runtime of the actual Python program executed by the user, the PyPy interpreter, implemented as a stack-based VM in RPython, is traced and not the user program itself. Typically, a TJIT approach is based on a tracing VM which goes through the following phases [Cun10]: Interpretation At first, the bytecode is interpreted as usual with the addition of a lightweight code for profiling of the execution to detect which loops are run most frequently (i. e. hot loops). For this purpose, a counter is incremented at each backward jump. At a certain threshold, the VM enters the tracing phase.

Tracing The interpreter records all instructions of a whole hot loop itera- tion. This record is called a trace which is passed to the JIT compiler. The trace is a list of instructions with their operands and results.

Compilation The JIT compiler turns a trace into efficient machine code that is immediately executable and can be reused for the next itera- tion of the hot loop.

Running The compiled machine code is executed. The phases above represent only the nodes of a graph with many possible paths which is not linear. For ensuring correctness, a trace contains a guard at each point where the path in the control flow graph (CFG) could have followed another branch, e. g., at conditional statements. If a guard fails, the VM falls back into interpretation mode. However, the meta-tracing approach of PyPy is different. As the traced program is the PyPy interpreter itself and not the interpreted program, a hot loop is the bytecode dispatch loop (and for many simple interpreters this is the only hot loop). Tracing one iteration of this loop means that the recorded trace corresponds to executing one opcode (i. e. a machine code

138 7.1 HPC Python Fundamentals instruction) and it is very unlikely that the same opcode is executed many times in a row. Therefore, the corresponding guard will fail, meaning that the performance is not improved. Better if the execution of several opcodes could be traced which would effectively unroll the bytecode dispatch loop. Ideally, the bytecode dispatch loop should be unrolled exactly so much that the unrolled version corresponds to one loop in the interpreted user program. Such user loops can be recognized if the program counter (PC) of the PyPy interpreter VM has the same value several times. Since the JIT cannot know which part of the PyPy interpreter represents the PC of the VM, the developer of the interpreter needs to mark the relevant variables with a so-called hint. More on meta-tracing can be found in [Bol+09]. PyPy provides different parameters controlling the behavior of JIT compilation with some magic numbers which are [BL15]: Loop threshold Determines the number of times a loop must be iterated to be identified as hot loop (default: 1619);

Function threshold Determines how often a function must be called to be traced from the beginning (default: 1039);

Trace eagerness If a guard failures happen above this threshold, the TJIT attempts to translate the sub-path from the point of the path failure to the loop’s end which is called a bridge (default: 200).

Memory Management in PyPy

Since PyPy’s initial release in 2007, many garbage collection methods, without reference counting, were implemented, such as Mark and Sweep, Semispace Copying Collector, Generational GC, Hybrid GC, Mark & Compact GC, and Minimark GC [PyPb] Currently the default one is Incminimark, a generational moving collector [PyPd]. Since Incminimark is an incremental GC, the major collection is incremental (i. e. there are different stages of collection). The goal is not to have any pause longer than 1 ms, but in practice it depends on the size and characteristics of the heap and there can be pauses between 10-100 ms.

7.1.3 Numba Numba is an open-source JIT compiler translating a subset of Python and NumPy into machine code [Numa] using the LLVM compiler infrastructure project [LLV]. Most commonly, Numba is used through so-called decorators to code parts that shall be compiled instead of being interpreted by

139 Chapter 7 HPC Python Internals and Benefits

CPython. Numba is therefore no alternative to CPython but an extension to it and available for Python 2 and Python 3.

Features of the Numba environment Since code compilation can be time intensive, only code parts that have a high share in total execution time should be compiled. There are two modes how the compiler treats the code [Anad]: Nopython mode Numba generates code which is independent of the Py- thon C API which is the interface for C programs to the Python interpreter. A function can be compiled in nopython mode only if a data type can be assigned to all objects accessed by the function. In nopython mode atomic (i. e. thread-safe) reference counters are used [Anaa] instead of the ones in CPython which are not thread-safe. Object mode Numba generates code which declares all objects as Python objects on which operations are performed with the aid of the Python C API. Therefore, the performance improvement is lower than in nopython mode, unless so-called loop-jitting can be applied by Numba. In the latter case the loop can be automatically extracted and compiled in nopython mode which is possible if the loop contains nopython-supported operations only [Anae]. Numba supports standard data types such as, e. g., int16, float32, and complex128 with a precision of up to 64 bit per value [Anaf]. For the compilation of a function by Numba, a decorator must be written before the function: @jit def f(x, y): return x + y There are two possibilities for using the jit-decorator: Lazy compilation Numba determines the function parameters’ type as well as the result type at run time, compiling special code for different input data types. Eager compilation The programmer determines all data types manually, i. e. in case of upper example the type definition could be: @jit(int32(int32, int32)) Moreover, following arguments can be set to True in the decorator [Anac]: nopython Numba tries to compile the function in nopython mode with an error message if not possible instead of an automatic fallback to object mode.

140 7.1 HPC Python Fundamentals cache Numba saves the machine code of the compiled function instead of compiling it at each call. nogil Since atomic reference counters are used in nopython mode, the GIL can be disabled, leading to a real concurrent execution of parallel running threads. Supporting the NumPy module, Numba provides the possibility to build NumPy universal functions (ufuncs). A ufunc is a function that operates on ndarrays (for definition see Sect. 7.1.1) in an element-by- element fashion, supporting several standard features [Scic]. Hence, a ufunc is a vectorized wrapper for a function that takes a fixed number of specific inputs, producing a fixed number of specific outputs. The wrapper therefore enables applying the wrapped function on the variable long ndarray. For the generation of a ufunc, the vectorize-decorator is used which allows lazy and eager compilation. In case of a lazy compilation, where no data types were defined, a dynamic universal function (DUFunc) is built which behaves like a ufunc with the difference that machine code is compiled for loops if the given data types cannot be cast to types of the existing code. In case of ufuncs, an error is thrown if the provided data cannot be cast [Scib]. The advantage of ufuncs over functions compiled with the jit-decorator is the support of features like broadcasting. Basic operations on ndarrays are performed element-wise which works on arrays of the same size. The broadcasting conversion, however, defines a way of applying operations on arrays of different size as specified in [Scia]. The vectorize-decorator supports scalar arrays only while guvectorize allows multi-dimensional arrays as input and output. Unlike vectorize, in GUfunc signatures also the dimensions and relations of the inputs must be provided in a symbolic way. A guvectorize-decorator for the known matrix-matrix multiplication could be used as follows: @guvectorize ([" void (int32 , float64 [:,:], float64 [:,:], float64 [:,:]) "], "() , (m,m), (m,m)", nopython = True ) def multiplication (n, A, B, C): for i in range (n): for j in range (n): for k in range (n): C[i][j] += A[i][k]* B[k][j] In both decorators the nopython parameter can be specified to avoid a fallback to object mode. Numba does not support the whole Python language in nopython mode. Moreover, not all Standard Library modules of Python are supported. More on both can be read in [Anah]. However, NumPy is well integrated [Anag].

141 Chapter 7 HPC Python Internals and Benefits

General Procedure of Numba Figure 7.4 shows the stages of the Numba compiler [Anab]: 1) Bytecode Analysis Numba analyzes the function bytecode to find the CFG.

2) Numba-IR Generation Based on the CFG and a data flow analysis, the bytecode is translated to Numba’s intermediate representation (IR) which is better suited to analyze and translate as it is not based on a stack representation (used by Python interpreter) but on a register machine representation (used by LLVM).

3) Macro Expansion This step converts specific decorator attributes (e. g. CUDA intrinsics for grid, block, and thread dimension) into Numba- IR nodes representing function calls.

4) Untyped IR Rewriting Certain transformations on the untyped IR are performed, e. g., for the detection of certain kinds of statements.

5) Type Inference The data type determination is performed as explained for lazy and eager compilation with fallback to object mode or error in nopython mode.

6a) Typed IR Rewriting Optimizations like loop fusion are performed, where two loops with operations on the same array are merged together into one loop.

6b) Automatic Parallelization This stage is performed only if the parallel parameter is passed to a jit-decorator for automatic exploitation of parallelism in the semantics of operation in Numba-IR.

7a) Nopython Mode LLVM-IR Generation If a Numba type was found for every intermediate variable, Numba can (potentially) generate specialized native code which is called lowering as Numba-IR is an abstract high-level intermediate language while LLVM-IR is a machine dependent low-level representation. The LLVM toolchain is then able to optimize this to an efficient code of the target CPU.

7b) Object Mode LLVM-IR Generation If type inference fails to find Numba types for all values inside a function, it is compiled in object mode which generates a significantly longer LLVM-IR as calls to the Python C API will be performed to basically all operations.

8 LLVM-IR Compilation The LLVM-IR is compiled to machine code by the LLVM JIT compiler.

142 7.1 HPC Python Fundamentals

@jit def do_something(a, b): ... >>> do_something(x, y)

Python Function Function (Bytecode) Arguments

Bytecode Type Typed IR Rewriting and Analysis Inference Automatic Parallelization

Nopython & Object Mode and LLVM-IR Generation Numba-IR Generation LLVM-IR Compilation

Macro Untyped IR Machine Code Expansion Rewriting

Computer Platform

Figure 7.4: Numba compilation stages

Numba does not implement a vectorization of the Numba-IR but LLVM can apply automatic vectorization for single instruction multiple data (stream) (SIMD) capable CPUs [Anai]. LLVM’s behavior on that can be changed by Numba environment variables [Anaj].

7.1.4 Cython

Cython is the name of a compiled programming language and of an open- source project, written in Python and C, an implementation of a Cython compiler with static code optimization [Cyt]. The Cython language shall combine the simplicity of Python with the performance of C/C++ as sketched in Fig. 7.5 [Behc] with mostly usual Python and additional C-inspired syntax. Therefore, it mostly supports Python 2 as well as Python 3 and extends Python by C data types and structures. A detailed

143 Chapter 7 HPC Python Internals and Benefits documentation of the differences in the semantics between the compiled code and Python is provided in [Beha].

Cython Extending Python

In Cython it is possible to optimize Python code by static variable decla- rations such as

cdef int i as it supports all basic C data types as well as pointers, arrays, typdef-ed alias types, structs / unions, and function pointers. Furthermore, also Python types such as list and dict can be declared statically. Variables without a static variable declaration are handled by the Cython compiler, with the aid of the Python C API, as Python objects. Moreover, it is possible letting the Cython compiler to typify variables as static in an automatic way, in certain functions or even the whole code, with the following compiler directive:

@cython . infer_types ( True )

The compiler then tries to find the right data types by reference to the assignments in the related code. However, the static typing of variables is not designated for the whole program. Only the variables within parts which are relevant for the performance should be statically typed. Anyhow, a conversion from Python objects to C or C++ types or objects is unavoidable as will become apparent later. A Python integer, for instance, can be converted to char, int, long, etc. and a Python string can be converted to a C++ std::string [Smi15]. Python and C functions have similarities as they obtain arguments and return values but Python functions are more powerful and flexible which makes them potentially slower. Cython therefore supports Python as well as C functions which can call each other.

Python Cython C

Simplicity Fortran C++

Performance

Figure 7.5: Comparison of Cython with other programming languages

144 7.1 HPC Python Fundamentals

A Python function is valid Cython code and can contain static type definitions as introduced before. These Python functions can be directly called by external Python code. A C function can be included by a wrapper or directly implemented in Cython and therefore declared with the keyword cdef instead of def. Contrary to Python code, C code is not processed by the Cython compiler as will become apparent later, too. A cdef function finally is a C function that is implemented in Python-based syntax. The function arguments’ types and return types are defined statically. In cdef-functions C pointers as well as structs and further C types can be used. Moreover, the call of a cdef-function is as performant as the call of a pure C function by a wrapper and the overhead of the call is minimal. It is also possible to use Python objects as well as dynamic typed variables in cdef-functions and pass them to the function in form of arguments. A cdef function cannot be called from external Python code but it is possible to write a Python function within the same module which is externally visible and calls the cdef function, as for example the following one:

def externaly_visible_cfunction_wrapper ( argument ): return cfunction ( argument )

A third possibility for the implementation of a function is provided by cpdef which combines the access possibility of Python functions with the performance of C functions [Rosa]. There is a restriction on Cython functions as the data types of the arguments and the return value must be compatible with C and Python. While each Python object can be represented in C, not each C type can be represented in Python, as for example C pointers and arrays. Cython provides a set of predefined Python and C/C++ related header files with the filename extension .pxd. Most important is the C standard library libc with the header files stdlib, stdio, math, etc. The same applies for the Standard Template Library (STL) with the option to make use of containers such as vector, list, map, etc. Cython allows an efficient access to NumPy’s ndarrays (for definition see Sect. 7.1.1) that is defined in a separate .pxd file as it is written in C. Besides ndarrays, also Python’s array module can be used efficiently as Python accesses the elementary C array directly. Since it is possible to access Python functions at runtime, they are not defined in header files. Both declarations and definitions are located in the implementation files with the filename extension .pyx.

145 Chapter 7 HPC Python Internals and Benefits

Cython Compilation Pipeline Cython produces a standard Python module but in an unconventional manner that is depicted in Fig. 7.6. A script (here: setup.py) is used to start the setuptools build procedure which translates the Cython implementation file(s) (here: hello.pyx) to optimized and platform inde- pendent C code (here: hello.c) with the aid of the Cython compiler. For instance, the mult function def mult (a, b): return a * b is compiled to several thousand lines of C code which mainly consists of defines for portability reasons: __pyx_t_1 = PyNumber_Multiply ( __pyx_v_a , __pyx_v_b ); if ( unlikely (! __pyx_t_1 )) __PYX_ERR (0, 2, __pyx_L1_error ) __Pyx_GOTREF ( __pyx_t_1 ); __pyx_r = __pyx_t_1 ; __pyx_t_1 = 0 It contains automatically generated variable names which make the code hard to read. However, this is no problem since no manual changes on it are expected. The first line invokes the function PyNumber_Multiply from the Python C API, which performs a multiplication between two Python objects that are passed in form of pointers to their addresses. The if-statement checks if the multiplication was successful and GOTREF implements the reference counting.

setup.py

Cython hello.pyx Compiler

hello.c C Compiler

hello.so

import

launch.py

Figure 7.6: Cython’s workflow for Python module building [Dav]

146 7.2 Benchmarking Methodology

Using Cython’s advantage of static type declarations, the Cython code, in case that int was used, is translated to the following C code: __pyx_t_1 = __Pyx_PyInt_From_int ( __pyx_v_a * __pyx_v_b ); if ( unlikely (! __pyx_t_1 )) __PYX_ERR (0, 2, __pyx_L1_error ) __Pyx_GOTREF ( __pyx_t_1 ); __pyx_r = __pyx_t_1 ; __pyx_t_1 = 0

Here, the multiplication is performed directly in the first line of the upper C code and the result is converted to a Python integer. It is also possible to convert the Cython code to C++ but the default target language is C. The outputted code of the Cython compiler can also be adapted by some directives listed in [Carb]. Afterwards, the generated C code is compiled by a C compiler such as gcc [GCC] or Clang [Cla] to a shared library file (here: hello.so on POSIX systems and hello.pyd on Windows). These shared libraries are called C extension modules and can be used like pure Python modules after a usual import. Depending on the setuptools script that is used, an extension module for the particular Python environment is generated. Therefore, Cython is not autarkic as it depends on a Python environment such as CPython or PyPy.

Parallel Programming in Cython

With the nogil keyword after a function definition, the GIL is released: cdef int function ( double x) nogil :

After the return the GIL is active again. Also external C/C++ functions can make use of concurrent processing by multithreading with nogil: cdef extern from " header .h": double function ( double var ) nogil

This is possible only if no Python objects are used within the function. Based on this also OpenMP can be used in an efficient manner [Behb].

7.2 Benchmarking Methodology

The benchmarking of the different environments for a high performance execution of Python programs was performed with the aid of the following algorithms:

Quicksort A sorting algorithm of the divide and conquer based approaches [Cor+01].

147 Chapter 7 HPC Python Internals and Benefits

Dijkstra Finds the shortest paths between a start node and all other nodes in a graph [Cor+01].

AVL Tree Insertion Insertion of values into a Adelson-Velsky and Landis (AVL) tree which is a self-balancing binary search tree [Cor+01].

Matrix-Matrix Multiplication Performs multiplication of two quadratic matrices of same.

Gauss-Jordan Elimination Solves a system of linear equations by row re- ductions [Sto+13].

Cholesky Decomposition Computes the decomposition of a symmetric and positive definite matrix into a lower left triangular matrix and its transpose. The product of them equals to the original matrix [Sto+13].

PI Calculation Iterative algorithm for the approximation of π based on an integration using the rectangle rule [Qui03].

These algorithms were chosen to represent different algorithm categories from the area of classical data processing on common ADTs such as lists, graphs, and trees on the one side and numerical mathematics on the other side. All algorithms except the PI calculation were implemented sequen- tially. The PI algorithm was chosen as a known example of a perfectly parallel workload. As such it can be utilized to benchmark how well the in- dividual environments can perform when the workload can be parallelized in an optimal way (i. e. with no synchronization / communication between the parallel processors). All algorithms are implemented in an iterative (not recursive) manner and the following languages:

• C++, as a time and memory efficient object-oriented compiled pro- gramming language

• Pure Python 2

• Pure Python 3

• Pure Python 3 with NumPy

• Pure Python 3 with NumPy and Numba decorators

• Python 3 with Cython

148 7.2 Benchmarking Methodology

For some algorithms certain implementations are not available if, for instance, there was no reasonable use of an ndarray from NumPy in case of the AVL tree implementation. Furthermore, there are no further PyPy specific implementations needed. The source code can be obtained by contacting the author.

Matrix-Matrix Multiplication Implementation as Example For the matrix-matrix multiplication in Python the code presented in Sect. 7.1.1 is used. In C++ the matrix is implemented based on a struct with an elementary two-dimensional array for which memory from the heap is allocated dynamically: struct Matrix { int n; double ** doublePtr ;};

Matrix newMatrix ( int n){ Matrix mat ; mat .n = n; mat . doublePtr = new double *[n]; for ( int i = 0; i < n; i ++) mat . doublePtr [i] = new double [n]; return mat ; }

In pure Python the matrices are represented by lists and initialized with the aid of list comprehension in Python 2 and Python 3: def newMatrix (n): return [[0 for x in range (n)] for y in range (n)]

For the ndarray the same data types as in the C++ version are used: def newMatrix (n): return np. zeros ( shape =(n,n), dtype =’float_ ’)

Relevant for the execution time are the three nested loops which is why the according function in the Numba version is based on the ndarray and has following decorator: @guvectorize ([" void (int32 , float64 [:,:], float64 [:,:],\ float64 [:,:]) "], "() ,(m,m) ,(m,m) -> (m,m)", nopython = True ) def multiplication (n, A, B, C):

In Cython the function that performs the actual multiplication was equipped with static type definitions and the ndarray was used for efficiency reasons as well: def multiplication ( int n, np. ndarray [np. float64_t , ndim =2] A, np. ndarray [np. float64_t , ndim =2] B): cdef np. ndarray [np. float64_t , ndim =2] C = newMatrix (n)

149 Chapter 7 HPC Python Internals and Benefits

cdef : int i int j int k for i in range (n): for j in range (n): for k in range (n): C[i,j] += A[i,k]* B[k,j] Here again, the data types are of the same precision as in the C++ version. The implementations of the further algorithms are achieved in a similar way to the matrix-matrix multiplication presented here.

Realization of the Measurements The execution time (wall clock time) of each runtime environment resp. executable (in case of C++) was measured with the shell command time [IEE18]. For memory usage measurements libmemusage.so was used and the captured value was the heap peak as defined in [Man]. To make sure that a given algorithm is executed with the same values, pseudo random generators were implemented to achieve the same input data in each run and runtime environment. For comparison reasons, auto- matic vectorization for CPUs with SIMD instructions were disabled in all Python environments as well as during the Cython and C++ compilation.

7.3 Comparative Analysis

A comparative analysis of the different Python runtimes against C++ was accomplished on the following computer system.

Measurements Environment All measurements in this section where accomplished on a server with 2 sockets, each with an Intel Xeon X7550 2.0 GHz (2.4 GHz Turbo), 8 cores CPU with Hyper-Threading (HT); 256 GB DDR3 main memory; running an x86_64 Scientific Linux 6. Following software packages were used: • CPython2 v2.7

• CPython3 v3.6.0 in combination with – NumPy 1.12.0 – Cython 0.25.2 – Numba 0.31.0

• PyPy2 v5.6.0 in combination with

150 7.3 Comparative Analysis

– NumPy 1.12.0 – Cython 0.25.0

• PyPy3 v5.5.0

• Clang v.3.9.1 with LLVM v3.9.1 PyPy’s NumPyPy module was not considered as it was too incomplete at the time of the analysis. The legends of the plots show different measured cases with following meanings: CPython2 Implementation in pure Python 2 and executed by CPython2

CPython3 Implementation in pure Python 3 and executed by CPython3

C++ Implemented in C++ and compiled by Clang

Cython (Pure Python) Implementation in pure Python 3, translated to C and compiled by Clang. The thus generated extension module is imported by a Python file

Cython (Optimized) Implementation in Cython with C data types, trans- lated to C and compiled by Clang. The thus generated extension module is imported by a Python file

CythonPyPy (Optimized) Analogue to “Cython (Optimized)”. The exten- sion module is utilized with the aid of the cpyext, PyPy’s subsystem which provides a compatibility layer to compile and run CPython C extensions inside PyPy [Cun]

Numba Implementation in Python 3

Numba + NumPy Implementation in Python 3 with the aid of ndarray

PyPy2 (Pure Python) Implementation in pure Python 2, executed by PyPy 2

PyPy3 (Pure Python) Implementation in pure Python 3, executed by PyPy 3

PyPy2 + NumPy Implementation in Python 2 with ndarray, utilized with the aid of the cpyext subsystem, and executed by PyPy 2

CPython3 + Array Implementation in Python 3 with array module and executed by CPython3

151 Chapter 7 HPC Python Internals and Benefits

Quicksort

104

103

102

101 Time [s]

C++ PyPy3(Pure Python) 0 10 CPython3 PyPy2(Pure Python) CPython2 PyPy2 + NumPy Cython(Pure Python) CPython3 + NumPy 10 1 Cython(Optimized) CPython3 + Array CythonPyPy(Optimized) Numba(Pure Python) Numba + NumPy

0.0 0.5 1.0 1.5 2.0 2.5 Size of array 1e7

Figure 7.7: Execution times for Quicksort

Quicksort Analysis as Example

In Fig. 7.7 the execution time of the various Quicksort implementations is plotted with logarithmic y-scale. Due to much higher execution times than in the other cases, the measurement of PyPy2 with NumPy was aborted when 2 million elements were to sort. The other measurements can be divided into a slower and a faster group. In the slower group, among others, are both reference Python environments CPython2 and CPython3. During the whole measurement, the CPython2 solution was around 30 % faster than CPython3 but the Cython-compiled version of pure Python 3 code is even faster than both CPython version. The usage of the arrays from the NumPy and the array module has no positive effects. CPython3 with NumPy is even slower than pure Python on CPython3. In the faster group, PyPy2 and PyPy3 have similar execution times: for a size of 10 million elements they are around 35 times faster than pure Python on CPython3. Even faster are both Numba cases (pure Python and with NumPy) and the optimized Cython module built for PyPy2 as runtime environment. Only the optimized Cython module for CPython3 has a higher performance than all other versions and is almost as fast as

152 7.3 Comparative Analysis

Quicksort 2000 C++ PyPy3(Pure Python) CPython3 PyPy2(Pure Python) CPython2 PyPy2 + NumPy 1750 Cython(Pure Python) CPython3 + NumPy Cython(Optimized) CPython3 + Array 1500 CythonPyPy(Optimized) Numba(Pure Python) Numba + NumPy 1250

1000

750 Heap-peak [Mb]

500

250

0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Size of array 1e7

Figure 7.8: Memory consumption (maximum heap peak) for Quicksort the C++ implementation. For more execution time measurements see Appendix B.1. In Fig. 7.8 the memory space consumption (i. e. maximum heap peak) of all previously mentioned Quicksort implementations are plotted with now linear y-scale. There, PyPy2 with NumPy and CPyton2 show a eminently higher memory consumption than the other cases which is why they were plotted in Fig. 7.9 separately. As the execution times of the three cases PyPy2 + NumPy, CPython3 + Array, and CPython3 + NumPy were very high, so that the memory measurements were aborted at 2 million elements. Here, the resulting plot lines can be divided into three groups. The group with the pure Python implementation on Numba and the array module based implementation on CPython3 show the highest memory consumption. The second group consists of PyPy2, PyPy3, CPython3, and the Cython module for CPython3 (all four pure Python). In the most memory efficient group are Numba + NumPy and CPython3 + NumPy. Moreover, the optimized Cython implementation for CPython3 has the lowest memory consumption which corresponds to the one of the C++ implementation. For more memory consumption measurements see Appendix B.2.

153 Chapter 7 HPC Python Internals and Benefits

Quicksort C++ PyPy2(Pure Python) CPython3 PyPy2 + NumPy 350 Cython(Pure Python) CPython3 + NumPy Cython(Optimized) CPython3 + Array 300 Numba + NumPy Numba(Pure Python) PyPy3(Pure Python)

250

200

150 Heap-peak [Mb]

100

50

0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Size of array 1e7

Figure 7.9: Memory consumption (maximum heap peak) for Quicksort of selected runtime environments

Parallel Processing

For the PI calculation the best-of-three execution times, which were measured for each case, was plotted in Fig. 7.10. The integration of the PI calculation was performed with 1 trillion rectangles. Depending on the number of threads to be forked for the measurement, an area of rectangles was assigned to each thread to achieve a parallel processing of the work by multithreading vs. multiprocessing. PyPy was not considered as it has a GIL like CPython. Therefore, as expected, there are no speedups gained from multithreading in case of CPython3 and CPython2. Instead, the JIT compiled Numba case (pure Python3 plus jit-decorator with nopython = True, nogil = True in its signature) and the AOT compiled Cython case (optimized with static types) achieve speedups through multithreading. The C- and Cython-based solution with four threads has an execution time of 2.5 s while with one thread 10 s were needed, which equals to a parallel speedup of 4 and an efficiency of 100 %. The multiprocessing case (pure Python3) also leads to speedups but these are much lower.

154 7.4 Conclusion and Outlook

PI Calculation 103 C Numba CPython2 Cython CPython3 Multiprocessing

102 Time [s]

101

100 1 Thread 2 Threads 3 Threads 4 Threads

Figure 7.10: Execution times for PI calculations with multiple threads

7.4 Conclusion and Outlook

In this chapter a comparative analysis of various High Performance Python environments was presented. For this, benchmark algorithms from dif- ferent categories were chosen and also implemented in C++ as reference language. Furthermore, great value was given to ensuring an equivalent implementation in each case to achieve similar conditions for the different cases. Hence, only one C/C++ compiler (i. e. Clang) with the same options was used for the compilation of the Cython and C++ solutions. In case of the sequential Python-based solutions, Cython optimized by static data type declarations and built for CPython3 has shown the shortest execution times in all test cases, which could always compete with the C++ implementations. Even the unoptimized Cython builds for CPython3 has led to performance gains of 40 to 50 %. The execution times for PyPy2 builds have shown great variations. Numba solutions also have led to high performance gains despite longest startup times of the runtime environment. Both PyPy versions have had similar execution times, which in case of pure Python were always at least one order of magnitude faster than the CPython environments.

155 Chapter 7 HPC Python Internals and Benefits

The memory consumption measurements regarding the heap peak during runtime have shown the highest consumption for CPython2. Both PyPy versions have also shown a higher memory consumption than CPython3. For particularly large data structures, it has been shown that a very efficient memory consumption can be achieved through the usage of static data types in Cython built for CPython3. In case of Cython built for PyPy2 the results has shown great variations depending on the algorithms. Numba also has led to a higher memory efficiency than in case of CPython. The multithreading benchmark has shown no parallel speedup for both CPython versions because of the GIL. In case of Numba and Cython where the GIl could be disabled, multithreading led to high parallel speedups. Especially in case of Cython the perfectly parallelizable PI calculation led to a perfect parallel efficiency of around 100 %. The usage of multiprocessing in case of CPython led to low speedups only. The presented comparative analysis gives Python programmers an overview of the analyzed solutions to improve the performance of their Python programs. Moreover, it provides information on how much effort is required for the application of a certain solution on the one side and which gain can be expected on the other side. In future work, a closer look should be taken at Numba’s precompiled functions that are vectorized. These were not considered in this work to achieve comparable implementation between the various solutions. Further- more, besides multithreading and multiprocessing as parallel paradigms suitable for shared-memory computer architectures, also paradigms that are suitable for distributed-machines such as Message Passing Interface (MPI) for Python should be included in future considerations. An imple- mentation of MPI for Python is mpi4py [Dal]. Since all Python runtime environments and the language itself are under development, a framework for automated ongoing comparative analyses and result presentation would be useful.

156 8

HPC Network Communication for Hardware-in-the-Loop and Real-Time Co-Simulation

A digital real-time simulator (DRTS) for power grids reproduces voltage and current waveforms with a desired accuracy that represent the behavior of the real power grid that is simulated. To be RT-capable, the DRTS needs to solve the power grid model equations for each time-step within the time passed in the real world (i. e. according to the wall clock time) [Far+15; BDD91]. To achieve this, outputs are generated in the simulations at discrete time intervals while the system states are computed at certain discrete time intervals with a fixed time step. In [Far+15] two classes of digital real-time (RT) simulations are defined. There are full digital RT simulations that are modeled in the DRTS completely and (power) Hardware-in-the-Loop (HiL) RT simulations which can exchange simulation data through I/O interfaces with real hardware. Besides the improvement of DRTSs, e. g., with the aid of more performant numerical algorithms to be able to simulate increasingly complex models of Smart Grids in real-time, it is also possible to distribute a simulation among multiple DRTSs. An approach of a coupling of DRTSs in laboratories even from different countries is presented in [Ste+17]. The coupling of this so-called geographically distributed real-time simulation (GD-RTS) was performed with the VILLASframework [Vog+17], abbreviated in the following as VILLAS, which was chosen for the integration of InfiniBand (IB).

157 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

In the following section, the fundamentals of VILLAS are covered to motivate its choice for the integration of IB. The fundamentals of IB are introduced in the subsequent section. Then, the concept of the IB support in VILLAS is presented and analyzed comparatively with other interconnecting methods available in VILLAS. Finally, the chapter is concluded and an outlook of future work is given. This chapter presents outcomes of the supervised thesis [Pot18].

8.1 VILLAS Fundamentals

The VILLASframework is a collection of open-source software packages for local and geographically distributed RT (co-)simulations. VILLASnode is part of the collection that can be used as gateway for simulation data. It supports several interfaces (called node-types) of the three classes, internal communication such as file for logging and replay, shmem for shared-memory communication, signal for test signal generation, etc.; server-server communication such as socket for UDP/IP communication, mqtt for Message Queue Telemetry Transport (MQTT) communica- tion, websocket for WebSocket based communication, etc.; simulator-server communication such as opal for connections to OPAL- RT devices, gtwif for connections to RTDS devices, fpga for connec- tions to VILLASfpga PCI-e cards, etc. The instance of a node-type is called a node. In Fig. 8.1 besides VILLASnode also the VILLASweb component of the whole framework is depicted. As sketched in the figure, a lab(-oratory) can contain multiple nodes which are used as gateway between software (SW) and hardware (HW) solutions. The interconnected nodes can run on the same or on different host systems in one or multiple labs. Some of the nodes can be hard or just soft RT capable, which depends on their node-type. While there are hard RT capable node-types for internal and simulator- server communication, there was no such node-type for server-server communications, because they are all depending on the Internet Protocol (IP) which is mostly used with Ethernet based interconnects for local area networks (LANs). One problem of Ethernet based solutions are relatively high latencies of the data transfers also caused by the network protocol stack of the operating system [Lar+09]. Another problem of Ethernet based solutions is that quality of service (QoS) support is very limited.

158 8.2 InfiniBand Fundamentals

That is why latencies of the data transfers have a relatively high variability which is a disadvantage for hard RT applications. To achieve hard RT between different computer hosts, IB was used as alternative technology designed for low-latency and high-throughput server-server and device- server communication (e. g. for interconnecting storage solutions with computer clusters). The following chapter introduces how these properties of IB are achieved.

8.2 InfiniBand Fundamentals

Before an introduction to IB with its benefits, the main difference to classical utilization of network interface controllers (NICs) must be ex- plained. Usually, NICs are utilized through sockets (also called Berkeley or BSD sockets) which is an Application Programming Interface (API) for inter-process communication (IPC) coming from the Unix-like Berkeley Software Distribution (BSD) operating system (OS) [Tan09] and with lit- tle modification standardized in the Portable Operating System Interface (POSIX) specification. However, socket API implementations are not only part of POSIX conform OSs but, e. g., also of Windows and Linux. The focus in this chapter is on the latter as the approach presented here was implemented based on Linux. A POSIX socket is a user space abstraction

user n data model domain-specific as a parameter ... web-based offline analysis access service setting user 2 simulation user 1 web-based as a web-basedaccess ... service

VILLASweb access offline integration layer

soft real-time integration layer

SW SW SW nodenode ... nodenode ... nodenode lab 1 HW HW lab n HW VILLASnode

hard real-time integration layer

Figure 8.1: Overview of the VILLASframework

159 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation of network communication, which is mainly using the operating system kernel based TCP/IP or UDP/IP stack (Transmission Control Protocol (TCP), User Datagram Protocol (UDP), IP)[Ker]. The network commu- nication through sockets is accomplished via function calls on the socket. As in many other OSs, these user space calls are mapped on system calls (i. e. OS kernel function calls) which generate so-called traps (a type of synchronous interrupt) and sysenter instructions in case of modern computer architectures which let the central processing unit (CPU) switch from user to kernel mode [Ker10; Tan09]. The switches between user and kernel mode (and back) can be time expensive in relation to the data transfer through the NIC itself. This and other drawbacks were the reason for a development of the virtual interface architecture (VIA) [Com97]. Some of the VIA characteristics mentioned in [Dun+98] are the avoidance of system callbacks whenever possible, data transfers with zero-copy, no interrupts for initializing and completing data transport, and there is a simple set of instructions for exchanging data. Therefore, some of the tasks, which are handled by the IP stack in case of standard sockets (i. e. such that are mapped on standard kernel sockets), such as data transfer scheduling, in VIA must be handled by the NIC. Contrary to standard sockets, VIA provides virtual interfaces (VIs) which are direct interfaces to the NIC through which each process assumes to own the interface and that there is no need for system calls during data transfers. Each such VI consists of a send and receive (work) queue, which can hold descriptors that contain all needed information for data transfers such as the destination address, transfer mode, and location of the payload in the main memory. After completed transfers (with or without an error), the descriptors are marked by the NIC. Usually, the so-called VI consumer, residing in the user space, is responsible to remove processed descriptors from their work queues. Alternatively, on creation, a VI can be bound to a Completion Queue (CQ) where notifications on completed transfers are directed. Each CQ has to be bound to at least one work queue which means that notifications of multiple work queues can be directed to a single queue. The VIA supports the two following asynchronously operating data transfer models:

Send and receive messaging model (channel semantics) The receiving computer node (in this section referred to as node) specifies where in its local memory received data shall be saved by submitting a descriptor to the receive work queue. Afterwards, a sending node acts analogously with its data to be sent and the send work queue.

160 8.2 InfiniBand Fundamentals

Remote Direct Memory Access (RDMA) model (memory semantics) The so-called active node specifies the local memory region and the remote memory region of the so-called passive node. There are two possible operations in the RDMA model: In case of an RDMA write transfer, the active node specifies with the local memory re- gion the data to be sent while with the remote memory region it specifies where the data shall be stored. In case of an RDMA read transfer, the active node makes analogous specifications. To initiate an RDMA transfer, the active node specifies the local and remote memory addresses as well as the operation mode in a descriptor and submits it to the sending work queue. The operating system and other software on the passive node is not actively participating in the (write or read) transfer. Therefore, no descriptors are submitted to the receive queue at passive node.

8.2.1 InfiniBand Architecture

The InfiniBand Architecture (IBA) makes use of the abstract VIA design decisions [Pfi01]. The InfiniBand Trade Association (IBTA), founded in 1999 by more than 180 companies, describes the IBA in [Inf07] and the physical implementation of IB in [Inf16].

Network Stack

In Fig. 8.2 the IBA is depicted in form of a network stack which consists of a physical, link, network, and transport layer. Hints for the IBA realizations are given to the right of the various layers.

Endnodes and Channel Adapters

The communication within an IB network takes place between (end)nodes which can be, e. g., a server node or a storage system in a computer cluster. A Channel Adapter (CA) is the interface between a node and a link. It can be either a Host Channel Adapter (HCA) which are used in computer hosts, supporting certain software features defined by so-called verbs. Or it can be a Target Channel Adapter (TCA), which has no defined software interface and is normally used in devices such as a storage system.

161 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Service Types InfiniBand supports the following service types: Reliable Connection (RC) A connection between nodes is established and messages are reliably transferred between them (optional for TCAs). One Queue Pair (QP), which is IB’s equivalent to a VI), on a local node is connected to one QP on a remote node.

Reliable Datagram (RD) Single packet messages are transferred reliably without a one-to-one connection. A local QP can communicate with any other RD QP. This is optional and not implemented in the OFED stack (see Sect. 8.2.2).

Unreliable Connection (UC) Analogous to RC but unreliable (i. e. packets can get lost).

Unreliable Datagram (UD) Analogous to RD but unreliable.

Raw Datagram Packets are sent without IBA specific headers.

Message Segmentation The payload is divided into messages between 0 B and 2 GiB for all service types except for UD which supports messages between 0 B and 4096 B, depending on the message transmission unit (MTU). Messages bigger

consumer consumer operations (verbs)

IBA operations messages (queue pairs) segmentation & reassembly transport

network inter subnet routing (GRH) network

link encoding subnet routing (LRH) link media access control flow control

port channel adapter´s port / physical link physical

Figure 8.2: Network stack of the InfiniBand Architecture (IBA)

162 8.2 InfiniBand Fundamentals than the MTU are segmented into smaller packets by the IB hardware which, thus, should not affect the performance as in case of software based segmentation [CDZ05]. In the following, QPs are explained further.

Queue Pairs and Completion Queues

Figure 8.3 shows an abstract model of the IBA. Such as VIs also QPs have Send Queues (SQs) and Receive Queues (RQs) which enable processes to directly communicate with the HCA. Like descriptors in the VIA, Work Requests (WRs) are submitted to a work queue before message transfer, resulting in Work Queue Elements (WQEs) in the queue. In case of a send WR, the WQE contains the address to the memory location containing data to be sent. In case of a receive WR, the WQE contains the address to the memory location where received data shall be stored. Not each QP can access each memory location due to memory protection mechanisms

main memory QPs CQs

... 0x0A Nx 0x0B Mx IBA memory 0x0C QP 0x0D management CQ 0x0E send recv CQ send recv CQ 0x0F send recv send recv ...

channel adapter transport DMA Engine

VL 1 VL 2 ... VL P VL 1 VL 2 ... VL P VL 1 VL 2 ... VL P

arbiter arbiter ... arbiter

port 1 port 2 ... port Q

Figure 8.3: InfiniBand Architecture (IBA) model

163 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation that also handle which locations can be accessed by remote hosts and the HCA. A WQE in the SQ also contains the network address of the remote node and the transfer model (i. e. send messaging or RDMA).

Data Transmissions Example

Figure 8.4 shows a sending and a receiving node, each with three QPs. Each QP is always initialized with a send and a receive queue but for the sake of clarity the unused empty queues are not depicted. Before a transmission, the receiving node submits WRs to its RQs. In the figure, the receiving node’s consumer is submitting a WR to the red RQ. Afterwards, WRs can be submitted to the SQs and will then be processed by the CA. While the processing order between queues depends on the priority of the services, on congestion control, and the HCA, WQEs within a certain queue are processed in first in – first out (FIFO) order. In the figure, the sending node’s consumer is submitting a WR to the black SQ and the HCA is processing a WQE from the blue SQ. After the HCA processed a WQE, it places a Completion Queue Entry (CQE) in the completion queue, which, i. a., contains information about the processed WQE and the status of the operation, indicating a successful transmission or an error if, e. g., the queue concerned was full. A CQE is posted as soon as a WQE is processed, which depends on the used service type. For instance, in case of a unreliable type, a CQE is posted as soon as the HCA sends the data belonging to a send WR. Instead, in case of

sending node receiving node

send WQE WQE WQE WQE WQE WQE WQE work request recv. WQE WQE WQE WQE work request

WQE WQE WQE WQE WQE WQE consumer consumer

send queues receive queues

HCA message HCA completion queues completion queues

work work compl. CQE CQE CQE CQE CQE CQE CQE compl.

Figure 8.4: InfiniBand data transmission example

164 8.2 InfiniBand Fundamentals a reliable type, the CQE is not posted until the message is successfully received by the remote node. In the figure, the receiving node’s HCA is consuming a WQE from the blue receive queue. After receiving a WQE, the HCA will write the received message into the memory location contained in the WQE and post a CQE. If the sending node’s consumer have included so-called immediate data in the message, that will be present in the CQE of the receiving node.

Work Queue Entry Processing

After the submission of a WR to a queue by a process, the HCA starts processing the resulting WQE. In Fig. 8.3 can be seen that an internal Direct Memory Access (DMA) engine is accessing the memory location containing in the WQE and copying the data from that location to a local buffer of the HCA. Every HCA port has several such buffers, called Virtual Laness (VLs). After this step, the arbiter of each port decides from which VL packets will be sent through the physical link. More on that and further details on the InfiniBand Architecture can be found in [Pot18].

8.2.2 OpenFabrics Software Stack

The IBA does not include a full API specification to allow vendor specific APIs. In 2004 the nonprofit OpenIB Alliance was founded and renamed later to OpenFabrics Alliance, which releases the open-source OpenFabrics Enterprise Distribution (OFED). OFED is a software stack including, i. a., software drivers, kernel code, and user-level interfaces such as verbs. Most InfiniBand vendors provide OFED based software, with little adaptions and enhancements, together with their hardware solutions. Figure 8.5 shows the sketch of an OFED stack [Mel18] where the user and the kernel verbs can be seen, whereby in this work verbs always refer to user verbs.

Submitting Work Requests to Queues

The submission of WRs to the work queues allows user space processes to initiate data transfers through a HCA without an intervention of the operating system kernel. As mentioned before, WQEs contain memory locations for data read and written by the HCA. A WR contains the pointer to a list with at least one scatter/gather element (sge) containing the memory address and length of a memory location as well as a local key for access control. Besides a list of sges, the receive WR structure contains only a few further data elements such as a pointer to the next

165 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation receive WR. Additionally, the send WR structure contains even more elements by which various (sometimes optional) features of HCAs can be enabled. The opcode element defines the operation to send the associated message(s). Which operations are allowed depends on the QP the WR is submitted to and the chosen service type. Furthermore, send_flags can hold various flags defining how the send WR shall be processed. One of the flags is IBV_SEND_INLINE which causes that the data pointed to by the sge is directly copied into the WQE by the CPU. This avoids a copying, performed by the HCA’s DMA engine, from the host’s main memory to the internal buffer of the HCA. The inline send mode is not defined in the original IBA and therefore not supported by each HCA.

application diag. open process process process level tools SM

traditional user level user API network MAD API interface

user space OpenFabrics user verbs / RDMA CM socket layer

TCP UDP ICMP upper data channel command IP layer (kernel bypass) channel protool netdevice

kernel space IPoIB

CMA

SA client MAD SMA CM mid-layer OpenFabrics kernel verbs

provider adapter drivers

hardware InfiniBand HCA

Figure 8.5: An overview of the OFED stack

166 8.3 Concept of InfiniBand Support in VILLAS

Since it potentially leads to lower latencies and the buffers can be released for re-use immediately after submission of the send WR, the InfiniBand integration here presented makes use of the inline mode. More details about the OFED can be found in [Pot18].

8.3 Concept of InfiniBand Support in VILLAS

The InfiniBand support was implemented in the VILLASnode sub-project of the VILLASframework. Therefore, the VILLASnode component is introduced in the following.

8.3.1 VILLASnode Basics

As already mentioned in Sect. 8.1, VILLASnode supports different node- types. One VILLASnode instance, called super-node can have multiple nodes that are sources and / or sinks of simulation data. Hence, a super- node can serve as a gateway for simulation data. In the context of VILLAS, a node is defined as an instance of a node-type from one of three categories introduced in Sect. 8.1. The connections within a super-node are realized with paths between source and output nodes. A path starts from an input node obtaining data that can, optionally, be sent through a hook to modify the data (e. g. by a filter). Subsequently, the data is written to a FIFO queue (for buffering) before it can be sent through a register which can multiplex and mask it. After this, it can be manipulated by more hooks again before it will be passed to the output queue which holds the data until the output node is ready. The data is transmitted as samples holding the payload (e. g. simulation data) with metadata (timestamps and a sequence number). As a sample is the internal format for payload exchange between nodes of arbitrary types, its structure is kept simple to avoid overhead. Figure 8.6 depicts a super-node with five node-type instances: opal, file, socket, mqtt, and the additionally implemented IB node, presented in this chapter. The paths (1 to 3) connect the nodes (n1 to n5) through hooks (h1 to h6), registers (r1 to r3), and input queues (q1,1 to qi,5) as well as output queues (qo,1 to qo,4). More on the types of the nodes can be found in [FEIg].

8.3.2 Original Read and Write Interface

For the interoperability between nodes of different types, various functions such as start(), stop(), read(), write() must be provided by the imple-

167 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation mentation of a new node-type in form of assignments of the implemented functions’ addresses to the specified function pointers, as for instance: int (* read )( struct node *n , struct sample * smps [], unsigned cnt ); int (* write )( struct node *n , struct sample * smps [], unsigned cnt );

Some of the functions are optional and will be omitted if no implementation is available for a certain node-type.

Read Function in General

Figure 8.7 depicts the general proceeding of the read function of an arbitrary node-type. During the call of a read function, the super-node passes the address to a field of sample addresses (*smps[]) of the length cnt ≥ 1 for the data the super-node wants to read from the node. A sample contains, i. a., a sequence number, reference counter (refcnt), and a field for the actual signal (i. e. payload such as, e. g., 64 bit integers and floating-point numbers). During the allocation of a sample by the

h1 h4 n1 opal qi,1 r1 h2 h3 qo,1 socket n3 h6

qi,2

qo,2 h5 r2 qi,3 mqtt n4

qi,4

qo,3

n2 file qi,5 r3 h3 IB n5

qo,4

path 1 path 2 path 3

Figure 8.6: An example super-node with three paths connecting five nodes of different node-types

168 8.3 Concept of InfiniBand Support in VILLAS super-node, its refcnt is set to 1 and their memory will not be freed when refcnt > 1. Releasing of a sample means decreasing its refcnt. Within the read function, the node (i. e. its receive module) is instructed to store at most cnt received samples in the passed list of samples. When the receiving module finished copying ret ≤ cnt samples, the read function returns ret. After that, the super-node can then process the samples by hooks before forwarding them to another node. Finally, all samples are released, usually resulting in freeing of their memory.

Write Function in General

The general proceeding of the write function of an arbitrary node-type is similar to the one of the read function. Here the super-node passes a field with cnt samples to the function which will be copied to the send module within the write function. The send module tries to send all samples which is blocking the return of the write function. After finishing the sending, the number of sent samples, ret, is returned. If ret is not equal to cnt, the super-node handles the sending error properly. Anyway, the refcnt of all cnt samples is decremented.

super-node super-node

allocate cnt decrease samples and set refcnt of refcnt = 1 cnt samples cnt cnt

ret process ret ≤ cnt smps[ ] cnt smps[ ] samples

*smps[ ] call read

_read( ) _read( ) return read node node copy max cnt return number of samples to *smps[ ] received samples

receive copy receive module module *smps[ ] *smps[ ]

Figure 8.7: General read function working principle in VILLAS

169 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

8.3.3 Requirements on InfiniBand Node-Type Interface

InfiniBand with its zero-copy principle, inherited from the VIA, requires that the receive / send modules do not copy any data between their local buffers and the super-nodes’ buffers. Instead, pointers to the super-nodes’ buffers and their lengths should be passed to the HCA which uses them directly for received data and for data to be sent. In the following the desired behavior of the read and write function is sketched.

Read Function of InfiniBand Node-Type

Figure 8.8 depicts the read function’s proceeding of the IB node-type. The QP is instructed to receive data by a WQE in its RQ. Therefore, a receive WR pointing to buffers of the super-node must be submitted to the RQ. For compatibility reasons with existing node-types, the further steps were implemented in a way that causes as few changes as possible. Therefore, within the read function, the addresses of the samples (passed as *snmp[] parameter) are assigned to sges and inserted into WRs which are then submitted to the RQ. This results in a direct storage of received

super-node super-node

allocate cnt decrease refcnt samples and of samples that are set refcnt = 1 not in any queue

cnt

ret process ret ≤ cnt smps[ ] cnt smps[ ] samples

*smps[ ] call read _read( ) _read( ) return read

place max cnt poll CQ, return amount of samples in RQ number of CQEs smps to release receive module receive module

*smps[ ] RQ CQ *smps[ ] RQ CQ

replace original pointers of samples that may not be released (i. e., that are submitted to the RQ) by node addresses in wr_id of the CQE node

Figure 8.8: InfiniBand node-type read function working principle

170 8.3 Concept of InfiniBand Support in VILLAS data by the HCA in the super-node’s samples field, avoiding data copying. Furthermore, the returning of the read function is very different from other node-types. If the CQ contains no CQEs, the HCA received no data and, thus, the ret value should be 0. However, the sample buffers must not be released (i. e. no refcnt may be decreased) as they are submitted to the RQ of the HCA. If the CQ contains CQEs, the addresses of the buffers from the CQ holding the received data are assigned to the pointers of the smps[] field that was passed to the read function. Moreover, the ret value is set to the number of pointers that have been replaced. Furthermore, the buffers containing the received data must be released after being processed by the super-node. This approach requires that the super-node calls the read function once (during initialization) without reading any data since only after this first call the HCA knows where to store received data.

Write Function of InfiniBand Node-Type The write function’s proceeding of the IB node-type must be similar to the read function in order to achieve zero-copy. When the addresses of the sample buffers that are passed to the write function are submitted via send WRs to the SQ, ret must be set to the number of submitted pointers. If the CQ is empty, none of the passed buffers may be released as the HCA has to send the data they contain. If the CQ is not empty, previously submitted WRs are finished and the buffers they point to can be released. Therefore, the addresses of the buffers that were passed in a previous call of the write function are assigned to the pointers of the pointers of the smps[] field that was passed with the current call of the read function. Furthermore, the super-node must be notified to release the sample buffers that were yielded by the according CQEs.

Adapted Read and Write Interface The original interface could be adapted in order to return the number of samples that must be released by the super-node as it cannot predict the number, especially in the case of sending inline, where buffers can be released immediately after send WR submission or in the case where a WR could not be successfully submitted to the SQ. The information on the number of samples to be released could be passed to the super-node by a further integer pointer in the signatures of the read and write function.

8.3.4 Memory Management of InfiniBand Node-Type VILLASnode allows memory allocation that is improved for real-time processing. The implemented alloc() function can allocate huge pages,

171 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation which leads to a faster mapping between virtual and physical memory [Deba]. Furthermore, it can lead to less page faults and in case of enabled page pinning, the pages must remain in main memory (i. e. are not swapped), which avoids delays in the execution of the program that could cause real-time violations. These and some other memory types are not sufficient for the IB node-type as the HCA will access the buffers, allocated by the super-node and referenced by WRs. Therefore, every node-type defines what kind of memory allocation is performed by alloc() and if it should be registered with a memory region (as needed for IB). Furthermore, this also allows to implement functionality for local key acquiring for samples that are passed to read / write functions. The definition of preferred memory-type for a node-type allows the super-nodes to use a proper memory allocations for input and output buffers that are connected to nodes of that type.

8.3.5 States of InfiniBand Node-Type Before the implementation of the IB node-type, a node could be in six states that are depicted in Fig. 8.9 as circles with solid lines. If, e. g., the _start() function of the node-type interface is called successfully,

_parse( ) initialized parsed

_check( )

destroyed checked

_start( ) pending connect

stopped started _stop( )

connected

Figure 8.9: VILLASnode state diagram with newly implemented states

172 8.3 Concept of InfiniBand Support in VILLAS the transition checked→started is performed. According to the VIA, a node-type could be initiated but not connected (i. e. the node is not able to send data). Therefore, the start state of VILLASnode is not sufficient and was extended by the new state connected. Moreover, before the receiving of any data, WQEs must be present in the regarding RQ. These circumstances lead to the finite-state machine in Fig. 8.9 with the new states printed with dashed lines. If a node is in one of these states, the super-node interprets it as if it would be in the start state. This finite-state machine can also be used for other future node-types than IB that are based on the VIA.

8.3.6 Implementation of InfiniBand Node-Type

An overview of the implemented IB node-type is shown in Fig. 8.10. The most important aspects are explained in the following, i. e. the read and

VILLASnode*

InfiniBand node

stop protection domain*

communication event channel* communication management thread*

start* rdma_cm_id* modify QP comm.

queue pair*

smps[ ] recv HCA* data read* buffers* send data write* smps[ ] CQs

recv

send

Figure 8.10: Components of InfiniBand node-type

173 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation write function which allow the kernel bypass offered by InfiniBand. The whole source code is open-source and part of the VILLAS project [FEIf].

Start Function

The start function is called by the super-node for initialization purpose. First, RDMA communication event channels are created to be able to resolve the remote address (as active node) or to place itself into a listening state (as passive node). Whether a node is active or passive is defined by its configuration. In case of a successful start, the super-node transitions into the started state.

Communication Management Thread

The communication management thread is spawned by the start function. It waits for the blocking rdma_get_cm_event() function for events such as connection requests, errors, rejection, and establishing. Depending on the node, the thread acts as follows: Active node As the node tries to connect to another node, the RDMA_CM_EVENT_ADDR_RESOLVED signals that the address could be resolved. After a succeeding initialization of various structures, the RDMA route is resolved which should end with an RDMA_CM_ EVENT_ROUTE_RESOLVED event, followed by an RDMA_CM_EVENT_ ESTABLISHED if the remote node accepts the connection which results in a transition to the connected state of the node. In this state data can be transmitted.

Passive node As the node listens for connection requests of other nodes, the RDMA_CM_EVENT_CONNECT_REQUEST event occurs if another node performs such a request. In case the service type is UC or RC, the node transitions to the pending connect state. In case of the unconnected service type UD, it transitions to the connected state. An RDMA_CM_EVENT_ESTABLISHED event occurs after a successfully es- tablished connection, which let the node transition to the connected state.

Read Function

The read function’s functionality differs from the principle as depicted in Fig. 8.7 as it can happen that samples could not be submitted successfully and therefore must be released again. For this purpose, a threshold number can be defined in the node’s configuration to achieve that at least

174 8.4 Analysis of the InfiniBand Support in VILLAS threshold samples can be received. If the threshold is reached, the CQ is polled until it contains enough CQEs which intentionally blocks the further execution of the read function. Moreover, entries in *smps[] are freed as it can hold only a certain amount of values.

Write Function

When the super-node calls the write function, it tries to submit all passed samples to the SQ. Iterating through the samples, the node decides dy- namically in which manner the samples have to be sent:

1. samples are submitted normally and are not released by the super- node until a CQE with the proper address appears;

2. samples are submitted normally but some are marked as bad and must be released by the super-node;

3. samples will be sent inline (i. e. are copied by the CPU directly into the HCA’s buffers) and must be released by the super-node.

More on the implementation of the InfiniBand node-type can be found in [Pot18].

8.4 Analysis of the InfiniBand Support in VILLAS

The performance of the newly implemented IB node-type is evaluated in the following in comparison to other already existing node-types of VILLASnode.

Measurements Environment

All measurements in this section where accomplished on a DELL T630 server with a NT78X mainboard providing 2 sockets, each with an Intel Xeon E5-2643 v4 3.4 GHz (3.7 GHz Turbo), 6 cores CPU with Hyper- Threading (HT); 32 GB DDR4 main memory at 2400MHz; 2x Mellanox ConnectX-4 MT27700 HCAs with 100GBit/s, interconnected via a 0.5 m Mellanox MCP100-E00A passive copper cable; running an x86_64 Fe- dora Linux with kernel 4.13.9-200 and MLNX OFED Linux 4.4-2.0.7.0. Moreover, the system was optimized for real-time processing.

175 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Real-Time Optimizations

The following real-time optimizations were accomplished [Pot18]:

Memory optimizations Achieved through the utilization of huge pages, aligned memory allocations, and memory pinning.

CPU isolation and affinity Achieved by using the isolcpus kernel parame- ter, excluding processor cores from general balancing and scheduling mechanisms, resulting in the avoidance of process scheduling to excluded CPUs unless a process is explicitly assigned to a CPU by sched_setaffinity(). Moreover, cpusets are used for allowing threads that are forked by processes on an excluded CPU to be sched- uled among all available excluded CPUs instead of being assigned only to the CPU of their forking process.

Non-movable kernel threads During system booting, kernel threads are created for tasks of the kernel and pinned to CPUs. This can be avoided to have no kernel threads running on excluded CPUs.

Interrupt affinity This is used for re-routing of interrupts that would disturb a CPU which is performing time-critical operations (e. g. busy polling on a signal variable for a certain event) to a CPU which is not assigned to time-critical processing.

Tuned daemon Red Hat based systems such as the used Fedora Linux support the tuned daemon for monitoring devices and adjusting system settings for higher performance. Supported tuning plugins are, e. g., cpu, net, sysctl, and so forth. The tuned offers many predefined profiles such as latency-performance for low-latency applications. For instance, this profile sets the CPU frequency governor to a performance.

Figure 8.11 shows the distribution of the CPUs among the cpusets. CPUs in the two real-time cpusets are limited to their memory locations of their non-uniform memory access (NUMA) node. This leads to lower memory access latencies as in NUMA computer architectures the main memory is distributed among the nodes (here: processors) of the system as shown in Fig. 8.11 for the described test system. The limited memory locations are also used for the respective HCAs for writing and reading of received data and data to be sent. Therefore, all time-critical processes using the HCAs (i. e. mlx5_0 and mlx5_1) were executed on the CPUs 16, 18, 20, and 22 as well as 17, 19, 21, and 23 (see Fig. 8.11).

176 8.4 Analysis of the InfiniBand Support in VILLAS

Dell PowerEdge T630

NUMA node 0 NUMA node 1 (internal distance: 10) (internal distance: 10)

cpuset: system cpuset: system

0 2 4 6 1 3 5 7

8 10 12 14 distance: 21 9 11 13 15 E5-2643 v4 E5-2643 v4 ® ® 16 18 20 22 17 19 21 23 Xeon Xeon 16 GB,DDR-4,2400MHz 16 GB,DDR-4,2400MHz cpuset: real-time-0 cpuset: real-time-1 no IRQs to this group no IRQs to this group

MELLANOX ConnectX4 MELLANOX ConnectX4 mlx5_1 / net-ib1 mlx5_0 / net-ib0

Cable: 0.5m long

Figure 8.11: Computer system with NUMA architecture used for measure- ments

VILLASnode Node-Type Benchmark

The VILLASnode node-type benchmark was used to compare the perfor- mance of different node-types. Its working principle is depicted in Fig. 8.12. First, the already existing signal node for sample generation with times- tamps was used which are then forwarded to a file node that stores them in a file of comma-separated values (CSV), called in. Concurrently, the samples are sent to a sending node of the type to be benchmarked. A receiving node gets the samples and writes them together with timestamps of their reception to a CSV file, called out. Therefore, the benchmark

super-node 1 super-node 2

file node

signal node in

node node file node

node-type under test

out

Figure 8.12: VILLASnode node-type benchmark working principle

177 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation is utilized for measuring the transfer latencies between nodes. Although the out file contains the generation and receive timestamps, the in file is needed to determine which samples were lost. Since the signal node can miss steps during sample generation at high rates, it can be determined if samples are missing because of the signal node or because they were lost between the nodes. The reason for the missed steps is explained in the following.

Sample Generation

For the payload generation at different rates, the signal node was config- ured accordingly. It can make use of two different timers: the timerfd, which relies on the wait on a file descriptor used that is used for notifica- tions and the time-stamp counter (TSC), which is a 64 bit CPU register that is incremented mainly depending on the CPU’s maximum core fre- quency. In separate latency benchmarks [Pot18] with a rate of 100 kHz with ten 64 bit floating point numbers per sample, and RC as service type of an IB node, it was determined that the timerfd had a higher effect than the TSC on the median etlat of the measured latencies. Indeed, with TSC and relatively small rates below 2500 Hz, steps were missed. For example, the fraction amount of missed steps at 100 Hz was around 8 %. Since using timerfd at this slow rates would have skewed the results and since a deviation of 8 Hz at a rate of 100 Hz will hardly influence any latency results, the TSC was chosen.

8.4.1 Service Types of InfiniBand Node-Type

In the following, different service modes of the IB node-type were compared, which are RC and UD as they are officially supported by the RDMA communication manager (CM) (see Fig. 8.5) and therefore do not require any modifications of the RDMA library. All measurements in this section were performed with 250’000 samples.

Various Sample Generation Rates

In these measurements the samples contained 8 random 64 bit floating- point numbers and were generated at rates between 100 Hz and 100 kHz. For RC with 24 B of metadata such a message has 88 B and for UD with a Global Routing Header (GRH) of 40 B a message has 128 B – all messages were sent inline. Figure 8.13 shows the results that are relatively similar for both modes (RC and UD) over all rates which is typical for InfiniBand as the reliability is implemented in the HCA which causes less overhead than

178 8.4 Analysis of the InfiniBand Support in VILLAS an implementation in the network stack (e. g. TCP/IP) of the operating system. In both cases, etlat is decreasing with higher frequencies and, thus, shorter periods between sample transmissions. Assuming one-way transmission times of 1 µs [Pot18], transmission rates of up to approximately 1 GHz should be possible. However, higher rates than 100 kHz were not measured as the signal node of VILLAS missed even more steps. A higher rate could not be achieved despite optimizations of the file node. Another limitation is that the rate for clearing the CQ and refilling the RQ, by the IB node, depends on the rate of the read function calls. If the RQ size is sufficient, it can absorb short message peaks but not in case of continuously high rates.

Various Sample Sizes

For a measurement over various sample sizes, the rate was fixed to 25 kHz and the messages contained 1 to 64 values per sample, resulting in messages of 32 B to 536 B for the RC and 74 B to 576 B for the UD type. Messages of 188 B or smaller were sent by the used HCAs inline. In Fig. 8.14 an increasing median latency can be seen when the message size exceeds about 128 B which is in accordance with the findings presented in [MR12]. Furthermore, the variability of the latencies with the UD type is higher than with the RC type. Moreover, the RC type shows lower latencies than the UD type which can be explained with the adding of the remote node’s address handle (AH) to every send WR and the GRH to every message, both not needed in case of the RC type.

4

3 ] s µ

[ 2

t a l t

1

0 100 2500 5000 10000 25000 50000 100000 sample generation rate [Hz]

RC UD

Figure 8.13: Median latencies etlat over various sample rates

179 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Various Generation Rates and Sample Sizes

The result of a combined measurement with various generation rates and sample sizes are shown in Fig. 8.15 for the RC service type only. The findings of both previous messages are reflected in this overall measurement diagram. The percentage of missed steps is also shown in Fig. 8.15 and

5

4

] 3 s µ [

t a l t 2

1

0 1 2 4 8 16 32 64 number of values in sample

RC UD

Figure 8.14: Median latencies etlat over various sample sizes

8%

6 4.0 3% 0% 8% 0% 5 0% 3% 10% 0% 55% 4 8% 0% 3.5 ]

s

3% 0% µ 8% 0% [ 3 t

0% a ] 22% l 3% 0% t s 8% µ 0% [ 0%

2 t

8% 3% a 0% 0% l 3.0 t 0% 0% 0% 8% 3% 0% 0% 0% 0% 1 3% 0% 0% 0% 0% 0% 0% 0% 0% 0 0% 0% 0% 2.5 0% 64 32 100 16 2500 5000 8 10000 2.0 25000 4 rate 50000 number of values in sample 2 100000

min tlat: 1.706 µs max tlat: 4.915 µs

% of samples missed by signal generator

Figure 8.15: Median latencies etlat over various sample rates and sample sizes

180 8.4 Analysis of the InfiniBand Support in VILLAS colored in red if the signal node missed more than 10 % of the steps. With these results, the data rate T can be calculated with

 pmissed  T = 1 − · ssample · fsignal, (8.1) 100 % with pmissed being the percentage of missed samples, ssample the size of a sample and fsignal the sample generation rate. In the measurements the data rate was approximately 20 MiB/s which shows that the file node was not able to process large amounts of data.

8.4.2 InfiniBand vs. Zero-Latency Node-Type For the comparison of the IB node-type with a zero-latency node-type, the shmem node-type was chosen as this is utilizing the POSIX shared-memory API for the communication between VILLAS nodes and therefore the latency between two shmem nodes will correspond to the shared-memory region used by both of them. Again, 250’000 samples were sent at rates between 100 Hz and 100 kHz, each containing 8 random 64 bit floating- point numbers. Figure 8.16 shows the results for both node-types. The latency differ- ences between the node-types can be assumed as being caused by the IB communication. Both, the median latency of the IB node and the one of the shmem node were decreasing with higher frequencies. Therefore, this effect cannot be caused by the PCI-e bus or the IB node implementation itself.

4

3 8.03%

3.72% 0.03% 0.08% 0.13%

] 0.25%

s 0.49% µ

[ 2

t a l t

1 8.03% 3.71% 0.03% 0.04% 0.07% 0.14% 0.28%

0 100 2500 5000 10000 25000 50000 100000 sample generation rate [Hz]

infiniband (RC) shmem

% of samples missed by signal generator

Figure 8.16: Median latencies etlat of IB vs. shmem node-type over various sample rates

181 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Furthermore, the IB node only missed a negligible amount of steps more than the shmem node. This implies that the write function of the IB node returned fast enough and did not influence the signal generation too much. With median latencies of around 2.5 µs, transmission rates up to ~400 kHz could be possible.

8.4.3 InfiniBand vs. Existing Server-Server Node-Types

One reason for the integration of IB into VILLASnode was the lack of a hard real-time capable server-server node-type. Therefore, this section compares the IB and shmem node-type with existing node-types commonly used for server-server communication, zeromq and nanomsg. Once more, 250’000 samples were sent at rates between 100 Hz and 100 kHz, each containing 8 random 64 bit floating-point numbers.

Loopback vs Physical Link

First, the loopback IP 127.0.0.1 was used for the IP-based node-types zeromq and nanomsg nodes and then repeated on a real physical link. Afterward, a physical link was used for the IP-based node-types which usually is based on the Ethernet technology. However, to avoid skewing of the results by another link technology such as Ethernet, the IB HCAs was also used as physical link for the communication between the IP- based node-types. This was realized utilizing the Internet Protocol over InfiniBand (IPoIB) driver which provides an IP-based interface that can be used by the IP-based node-types. Figure 8.17 shows the results for the IP-based node-types. For rates below 25 kHz there were no relevant latency deviations between the loop- back and the physical link. Above 25 kHz the latencies on the physical links increased especially with zeromq. The percentage of missed steps for 100 Hz and 250 Hz was for both IP-based nodes the same as for the IB and shmem nodes, again indicating TSC to be the reason. In Fig. 8.18 the results of the IP-based nodes on physical links are compared to the ones of the IB and shmem nodes. It can be seen that the latencies of the hard real-time capable IB node were at least one order of magnitude lower than the ones of the IP-based node-types that are soft real-time capable only. Also the variability of the latencies in case of IB was very low comparatively to the IP-based types, especially for rates above 25 kHz, when the IP-based types showed increasing latency magnitudes.

182 8.5 Conclusion and Outlook

8.5 Conclusion and Outlook

The results presented in this chapter show that the integration of InfiniBand in the VILLAS framework enables the transmission of samples at relatively high rates with latencies of few microseconds and under hard real-time requirements. These low latencies were achieved by a strict compliance to the principles of VIA such as zero-copy and the utilization of InfiniBand’s capabilities to initiate data transmissions without using system calls or

103 ] s µ [

t 2

a 10 l t

100 2500 5000 10000 25000 50000 100000 sample generation rate [Hz]

nanomsg zeromq

nanomsg (lo) zeromq (lo)

Figure 8.17: Median latencies of nanomsg and zeromq node-type over vari- ous sample rates via loopback (lo) and physical link

103

102 ] s µ [

t 1

a 10 l t

100

100 2500 5000 10000 25000 50000 100000 sample generation rate [Hz]

infiniband shmem

nanomsg zeromq

Figure 8.18: Median latencies of nanomsg and nanomsg via physical links as well as IB and shmem node-type over various sample rates

183 Chapter 8 HPC Network Communication for HiL and RT Co-Simulation the active participation of a CPU. This is why the IB node-type can be adapted for other VIA-based interfaces. While for small messages at high rates the IB node-type showed median latencies of around 1.7 µs, the median latencies for larger message sizes at low rates were around 4.9 µs. Compared to the – almost zero-latency – shmem node-type, the median latencies were only 1.5 −2.5 µs higher which is of high value in the area of real-time processing as the shmem allows only communication between the nodes of a shared-memory system which are typically located on the same computer. Moreover, existing VILLAS node-types for communication among different systems over IP showed median latencies that were one to two orders of magnitude larger than in case of IB. The latter can furthermore be used for much higher sample rates. With the IB node-type, VILLASnode can be used for hard real-time capable coupling of simulators running on conventional and inexpensive computer hardware in academic area and industry. Moreover, in the future, e. g., HiL setups are possible where devices to be connected to a computer host, running the simulation, can be supplied with an InfiniBand TCA for low latency data transfers between the device and the simulation. The same setup can be used for real-time operation. The IB node-type implementation could be further improved for real- time processing with the aid of RT_PREEMPT-capable Linux kernel. Further performance analyses, e. g., based on a profiling of the node-type’s read and write functions could be accomplished for a code optimization leading to even lower latencies.

184 9

Conclusion

In the following, the conclusions of all previous chapters are summarized and discussed.

9.1 Summary and Discussion

This dissertation presents various methods from high-performance com- puting (HPC) in support of power system simulation. These methods shall help other power system simulation users and developers in their undertakings. Therefore, all presented approaches were implemented in open-source software projects. In Chapter 2 a data model for multi-domain smart grid topologies, based on the Common Information Model (CIM), was presented. CIM was chosen despite the lack of all object classes for communication networks, many classes for energy markets, and some classes for power grids. The thus needed CIM extensions can lead to various extensions of different organizations. Of course, CIM is not the only possible information model but it provides the biggest well-specified subset of needed classes for a holistic representation of smart grid topologies from a high to a relatively detailed level. Moreover, the CIM User Group (CIMug) extends CIM by new classes to achieve a more and more holistic model of smart grid topologies. The developed SINERGIEN_CIM data model was furthermore used for the successful validation of the automated CIM (de-)serializer generation in Chap. 3 with ontologies that extend CIM, ensuring a general applicability of the approach.

185 Chapter 9 Conclusion

Chapter 3 introduced an approach for an automated (de-)serializer generation based on a UML to C++ code generation, a subsequent code adaption plus extension, and a template based (un-)marshalling code generation with the aid of a C++ compiler front-end. In contrast, instead of making use of a UML editor such as Enterprise Architect (EA), one could save the UML specification in an open document format such as XML Metadata Interchange (XMI). Then a code generator (to be developed) could directly generate (un-)marshalling code by traversing through the XMI document. This would make the code adaption as well as the compiler front-end processing unnecessary. In fact, this proceeding is currently applied for the integration of Common Grid Model Exchange Standard (CGMES) into CIM++. However, instead of an XMI document representing the CGMES UML specification, the generator for CGMES (un-)marshalling code for CIM++ makes use of Resource Description Framework Schema (RDFS) documents which define the structure of a concrete RDF based document type. Furthermore, the compilation of more then thousand CIM classes into libcimpp is not needed for each project using it. To reduce its size, e. g. for the application in embedded systems with a very limited main memory and program storage, an approach will be implemented which enables the possibility to choose a certain subset of CIM classes. Of course, all superclasses of a given subset must be automatically integrated into libcimpp as well. Despite all that, libcimpp is already utilized not only in Institute for Automation of Complex Power Systems (ACS) software projects but also by a Swiss and a Czech company and potentially also by other Github users which stay anonymous. This indicates that it is not only usable in academic area but also in enterprise application. Chapter 4 presents a template based translation method from CIM to simulator-specific system models as implemented in the CIMverter project. One could argue that the template based approach is too inflexible in comparison to a domain-specific language (DSL) based approach as more complex mappings must be implemented in C++. A contrary indication is that further component translations from CIM to Modelica required hardly any or even none changes in the C++ Modelica Workshop. Also the integration of the DistAIX system model format in CIMverter was accomplished in a couple of person-days as it could be performed mainly with new templates. Besides these examples, it must be assumed that complex mappings would also require a comprehensive DSL. This would lead to higher efforts while learning the DSL and implementing it in CIMverter. Furthermore, the presented mapping from CIM to simulator- specific system models covers so-called bus-branch models only. Therefore, CIMverter is not able to handle node-breaker models but this follows the

186 9.1 Summary and Discussion chosen UNIX philosophy of developing one program for one task [MPT78; Ray03] as a node-breaker model does not provide all information needed for a final system model. In fact, it provides a set of topologies which can differ depending on the configuration of the breakers. Hence, the mapping of a node-breaker model to a bus-branch model should be handled by a separate topology processor with respect to the given breaker configuration. Chapter 5 presents modern LU decomposition methods for circuit simu- lation that have been parallelized for current parallel processors. It shows a comparative analysis with the state of the art LU decomposition KLU, based on benchmark matrices as well as on real simulations performed by Dynaωo. It could be regarded as a disadvantage that solving of linear systems only was considered although in power system simulation usually non-linear systems are solved. One reason is that solving a non-linear system usually is implemented by linearization and a subsequent solving of a linear system. Furthermore, in case of large-scale power grid models, Réseau de Transport d’Électricité (RTE), the French transmission system operator (TSO), found out that during the solving of DAEs (e. g. with the aid of IDA) most of the CPU time is spent in the LU factorization (i. e. KLU). RTE and other partners of the PEGASE research project conducted a comprehensive analysis on different solvers. The outcomes of the analysis are another reasons why only LU decompositions and thus direct solvers have been analyzed and no iterative ones. Chapter 6 introduces the new approach type of an automatic fine-grained parallelization of mathematical models that was implemented in the new power grid simulator DPsim. It is about exploiting parallelism in math- ematical power system models for making use of multi-core processors with shared-memory architectures that are common in today’s computer systems. The MNA solver itself has not been improved which is based on the SparseLU method of the Eigen library which implements the supern- odal LU factorization [Sup] for sparse non-symmetric systems of linear equations. Obviously, at this point also other LU decompositions could be integrated and tried to be improved, analogously to the work already done for OpenModelica and Dynaωo in Chap. 5. This would improve the performance of task processing itself instead of the implemented paral- lel processing of the tasks which was the goal of this work. From HPC point of view, the power grid simulations that were performed, e. g., by OpenModelica, Dynaωo, and DPsim, because of the sparsity of the linear systems, never led to matrices of a size that was large enough for an efficient use of distributed-memory systems or even supercomputers. Even in case of large-scale static phasor power grid simulations with more than 7500 nodes, the matrices had indeed a size of 200000 × 200000 but with up to 700000 nonzeros their memory consumption was around 5 MiB only.

187 Chapter 9 Conclusion

Hence, no parallelization approaches for distributed-memory architectures were needed or implemented. This could change in case of large-scale dynamic simulations but such have not been considered, yet. Chapter 7 addresses different approaches for increasing the performance of Python programs. After an introduction to the ideas and internals of the Python runtime environments implementing these approaches, a comparative analysis based on algorithms from different algorithm classes is presented. The analysis helps programmers to understand how to adapt Python programs to achieve a better runtime performance in a certain environment, for instance based on just-in-time (JIT) compilation. Thus, it also helps to estimate the efforts and benefits of the development. The analysis was mainly focused on sequential processing with shared- memory parallelization by multithreading only but exactly achieving better performance in case of sequential Python code was in focus of the analysis. There are also other JIT compilation based programming languages such as Julia. Julia’s syntax is similar to MATLAB and Python and provides memory management making it easy to learn by programming beginners. However, Julia was developed as a language for scientific computing and Python is much more popular in engineering. Chapter 8 presents the implementation of InfiniBand (IB) support into the VILLASframework for Hardware-in-the-Loop (HiL) setups and the real-time (RT) coupling of simulators. The implemented IB communication shows transmission latencies that are one order of magnitude lower than the corresponding latencies of Internet Protocol (IP) based communication with nanomsg and zeromq. Furthermore, the IB latencies are slightly higher than in case of a shared-memory based data exchange which is limited to the same computer host. The InfiniBand latencies also show a very low variability which is important for RT requirements. Even though InfiniBand based communication is not suitable for wide area networks (WANs), distances above 15 m can be bridged by active fiber optic cables with hundreds of meters in length. Therefore, with InfiniBand interconnects, a widely-used HPC network technology can be applied for hardware-server and server-server communication even with hard RT requirements via the VILLASnode software gateway for simulation data exchange. Taken as a whole, it can be concluded that the work presented in this dissertation already improved or can improve the performance of different (co-)simulation environments. Furthermore, it enables the use of CIM topologies in different power system simulators, allowing the simulation of large-scale real world power systems. Also, many findings and approaches can be used for improving further software from the area of electrical engineering and beyond that. The implemented open-source

188 9.2 Outlook software projects can be used and improved by scientists and developers in academic area and free economy.

9.2 Outlook

Some concrete improvements of the developed concepts and approaches have already been mentioned in the Discussion above. One of today’s most important goals in the area of HPC for smart grid simulation is a solver for linear systems of equations, arising during simulations performed by power grid simulation environments, which scales with the cores of modern multi-core processors. At least in case of state-of-the-art steady-state simulations it has been seen that there is no need for parallel architectures with distributed memory. Workstations and servers with a shared-memory architecture can compete with steady-state simulations but larger and increasingly complex system models on the basis of component model im- provements, more elaborate models of new equipment, and more grid nodes require an efficient utilization of the utilized computer hardware. Hence, if simulations with more complex system models must not run longer, the software must make use of new hardware developments. Therefore, it needs more research and development on the power system simulation environments and their numerical back-ends to make use of wider vector units of today’s processors and accelerators such as graphic processing units (GPUs) and field-programmable gate arrays (FPGAs). Further scientific work related to HPC in power system simulation is also needed in the area of dynamic security assessment (DSA) based on dynamic grid simulation. In DSA systems, different scenario computations can be triggered by certain events such as the outage of grid equipment. Then, dynamic (n-1)-computations must be performed which can provide information on the voltage stability, small-signal stability, and transient stability of the system. DSA systems can make use of expert systems, for example on the basis of neural networks, that can derive grid operation improvements from the mentioned analyses. Since the real-time require- ments on DSA systems can be very challenging and dynamic computations can be very time-consuming, the application of high-throughput computing (HTC) on distributed-memory systems can be the method of choice, where HTC denotes a computing paradigm that focuses on the efficient execution of a large number of loosely-coupled tasks [Eur]. In the context of dynamic (n-1)-computations, a topology processor which generates bus-branch topologies from node-breaker models with a given breaker setting could be helpful. The (n-1)-computation control could make use of such a topology processor providing all topologies to

189 Chapter 9 Conclusion be considered in case of the scenarios which have to be calculated for a DSA computations triggering event. These topologies could be used for the aforementioned stability calculations as well es for additional dynamic and static protection simulations. Besides multithreading paradigms for shared-memory parallelizatin in Python programs, also paradigms for distributed-memory parallelization, e. g. with the aid of the Message Passing Interface (MPI). The bench- marking of an MPI implementation itself should not be performed by the examination of the performance of a set of MPI based applications but more systematically for different kinds of communication operations (e. g. individual, collective, one-sided, etc.), for various communication patterns (one-to-one, one-to-many, all-to-all), and multiple numbers of communication modes. For this purpose, an approach such as implemented in the special Karlsruher MPI-benchmark (SKaMPI) could be followed which performs various measurements of different MPI functions in a customizable way.

190

A

Code Listings

A.1 Exploiting Parallelism in Power Grid Simulation

Listing A.1: step method of the OpenMP-based level scheduler

void OpenMPLevelScheduler :: step ( Real time , Int timeStepCount ){ size_t i = 0, level = 0;

# pragma omp parallel shared (time , timeStepCount )\ private (level , i) num_threads ( mNumThreads ) for ( level = 0; level < mLevels . size (); level ++) { # pragma omp for schedule ( static ) for (i = 0; i < mLevels [ level ]. size (); i ++) { mLevels [ level ][i]-> execute (time , timeStepCount ); } } }

193

B

Python Environment Measurements

B.1 Execution Times

AVL Tree Insertion 104

103

102

101 Time [s]

100

C++ Cython(Optimized) 10 1 CPython3 CythonPyPy(Optimized) CPython2 PyPy3(Pure Python) Cython(Pure Python) PyPy2(Pure Python)

0 5000 10000 15000 20000 25000 30000 35000 Number of nodes

Figure B.1: Execution times for AVL Tree Insertion

195 Appendix B Python Environment Measurements

Dijkstra 104

103

102

101 Time [s]

100

C++ CythonPyPy(Optimized) CPython3 Numba + NumPy 10 1 CPython2 PyPy3(Pure Python) Cython(Pure Python) PyPy2(Pure Python) Cython(Optimized)

0 500 1000 1500 2000 2500 Number of nodes

Figure B.2: Execution times for Dijkstra

Gauss-Jordan Elimination 104

103

102

101 Time [s]

100

C++ CythonPyPy(Optimized) CPython3 Numba + NumPy 10 1 CPython2 PyPy3(Pure Python) Cython(Pure Python) PyPy2(Pure Python) Cython(Optimized)

0 500 1000 1500 2000 2500 3000 3500 Size of matrices

Figure B.3: Execution times for Gauss-Jordan Elimination

196 B.2 Memory Space Consumption

B.2 Memory Space Consumption

Cholesky Decomposition C++ CythonPyPy(Optimized) CPython3 Numba + NumPy 100 CPython2 PyPy3(Pure Python) Cython(Pure Python) PyPy2(Pure Python) Cython(Optimized) 80

60

Heap-peak [Mb] 40

20

0

0 200 400 600 800 Size of matrices

Figure B.4: Memory consumption (maximum heap peak) for Cholesky

197 Appendix B Python Environment Measurements

Gauss-Jordan Elimination C++ CythonPyPy(Optimized) CPython3 Numba + NumPy 200 CPython2 PyPy3(Pure Python) Cython(Pure Python) PyPy2(Pure Python) Cython(Optimized)

150

100 Heap-peak [Mb]

50

0

0 250 500 750 1000 1250 1500 1750 2000 Size of matrices

Figure B.5: Memory consumption (maximum heap peak) for Gauss-Jordan Elimination

Gauss-Jordan Elimination C++ Cython(Optimized) 40 CPython3 Numba + NumPy Cython(Pure Python) 35

30

25

20

15 Heap-peak [Mb]

10

5

0

0 250 500 750 1000 1250 1500 1750 2000 Size of matrices

Figure B.6: Memory consumption (maximum heap peak) for Gauss-Jordan Elimination of selected runtime environments

198 B.2 Memory Space Consumption

Matrix-Matrix Multiplication C++ CythonPyPy(Optimized) CPython3 Numba + NumPy 350 CPython2 PyPy3(Pure Python) Cython(Pure Python) PyPy2(Pure Python) 300 Cython(Optimized)

250

200

150 Heap-peak [Mb]

100

50

0

0 250 500 750 1000 1250 1500 1750 2000 Size of matrices

Figure B.7: Memory consumption (maximum heap peak) for Matrix- Matrix Multiplication

Matrix-Matrix Multiplication C++ Cython(Optimized) CPython3 CythonPyPy(Optimized) 100 Cython(Pure Python) Numba + NumPy

80

60

Heap-peak [Mb] 40

20

0

0 250 500 750 1000 1250 1500 1750 2000 Size of matrices

Figure B.8: Memory consumption (maximum heap peak) for Matrix- Matrix Multiplication of selected runtime environments

199

List of Acronyms

ACS Institute for Automation of Complex Power Systems 4, 97, 186 ADT abstract data type ...... 128 AH address handle...... 179 AMD Approximate Minimum Degree...... 80 AOT ahead-of-time...... 130 API Application Programming Interface ...... 109, 159 ARM Advanced RISC Machines...... 6 AST abstract syntax tree ...... 33, 58 AVL Adelson-Velsky and Landis ...... 148 AVX Advanced Vector Extensions ...... 125

BDF backward differentiation formula ...... 75 BSD Berkeley Software Distribution...... 159 BTF block triangular form ...... 80

CA Channel Adapter ...... 161 CASE Computer-Aided Software Engineering ...... 53 CFG control flow graph ...... 138 CGMES Common Grid Model Exchange Standard ...... 186 CiL Control-in-the-Loop...... 5 CIM Common Information Model . iv, viii, 8, 14, 31, 55, 75, 97, 185 CIMug CIM User Group...... 30, 185 CLI command line interface ...... 62 CM communication manager ...... 178 COLAMD Column Approximate Minimum Degree ...... 81 CP critical path ...... 106 CPU central processing unit ...... 6, 80, 100, 129, 160

201 List of Acronyms

CQ Completion Queue...... 160 CQE Completion Queue Entry ...... 164 CSV comma-separated values ...... 177

DAE differential-algebraic system of equations ...... 9, 75 DAG directed acyclic graph ...... 102 DER distributed energy resource ...... 4, 13 DES discrete event simulation...... 16 DES Discrete Event System Specification...... 17 DistAIX Distributed Agent-Based Simulation of Complex Power Systems ...... 56 DMA Direct Memory Access ...... 165 DMS distribution management system ...... 16 DOM Document Object Model ...... 31 DP dynamic phasor...... 97 DPsim Dynamic Phasor Real-Time Simulator ...... 97 DRTS digital real-time simulator ...... 5, 157 DSA dynamic security assessment ...... 3, 189 DSL domain-specific language ...... 56, 186 DSO distribution system operator ...... 11, 15 DUFunc dynamic universal function ...... 141

EA Enterprise Architect ...... 186 EHV extra high voltage ...... 84 EMS energy management system ...... 15 EMT electromagnetic transient simulation ...... 4 ENTSO-E European Network of Transmission System Operators for Electricity ...... 63

FIFO first in – first out ...... 164 FPGA field-programmable gate array...... 6, 189

GC garbage collector ...... 134 GD-RTS geographically distributed real-time simulation . . . . 5, 157 GIL global interpreter lock ...... 135 GMRES Generalized Minimal Residual Algorithm ...... 78 GP Gilbert/Peierls’ ...... 81 GPU graphic processing unit ...... 6, 76, 99, 189 GRH Global Routing Header ...... 178

HCA Host Channel Adapter ...... 161 HiL Hardware-in-the-Loop...... iv, viii, 5, 97, 157, 188

202 List of Acronyms

HLFET Highest Level First with Estimated Times ...... 106, 210 HLFNET Highest Level First with No Estimated Times ...... 106 HLFNET Highest Level First with No Estimated Times...... 110 HLL high-level language ...... 129 HPC high-performance computing...... vii, 6, 97, 185 HT Hyper-Threading ...... 85, 112, 150, 175 HTC high-throughput computing...... 189 HV high voltage ...... 84 HW hardware ...... 158

IB InfiniBand...... 10, 26, 157, 188 IBA InfiniBand Architecture ...... 161, 211 IBTA InfiniBand Trade Association ...... 161 ICT information and communications technology . . . . vii, 1, 13 IEC International Electrotechnical Commission ...... 29 IP Internet Protocol ...... 158, 188 IPC inter-process communication ...... 159 IPoIB Internet Protocol over InfiniBand ...... 182 iPSL iTesla Power System Library ...... 75 IR intermediate representation ...... 142 ISO International Organization for Standardization ...... 29 IVP initial value problem ...... 76

JIT just-in-time ...... 10, 127, 188

LAN local area network ...... 158 LSE linear system of equations...... 9

MD minimum degree ...... 79 MDA model-driven architecture ...... 30 MNA modified nodal analysis ...... 97 MPI Message Passing Interface ...... 125, 156, 190 MPS ModPowerSystems ...... 75 MQTT Message Queue Telemetry Transport...... 158 MTU message transmission unit ...... 162

ND nested dissection...... 79 NIC network interface controller ...... 159 NR Newton-Raphson ...... 94 NUMA non-uniform memory access...... 176

ODE ordinary differential equation ...... 3, 98

203 List of Acronyms

OFED OpenFabrics Enterprise Distribution ...... 165 OMG Object Management Group ...... 51 OOP object-oriented programming ...... 36 OS operating system ...... 159

PC program counter ...... 139 POSIX Portable Operating System Interface ...... 41, 135, 159

QoS quality of service ...... 158 QP Queue Pair ...... 162 QVT Query/View/Transformation...... 51

RC Reliable Connection ...... 162 RD Reliable Datagram...... 162 RDF Resource Description Framework ...... 16, 31 RDFS Resource Description Framework Schema ...... 186 RDMA Remote Direct Memory Access ...... 161 RPython Restricted Python ...... 136 RQ Receive Queue...... 163 RT real-time ...... 4, 85, 97, 157, 188 RTE Réseau de Transport d’Électricité ...... 75, 187 RTTI runtime type information ...... 62

SAX Simple API for XML ...... 33 SCADA supervisory control and data acquisition ...... 15 SGAM Smart Grid Architecture Model ...... 14 sge scatter/gather element...... 165 SiL Software-in-the-Loop...... 5 SIMD single instruction multiple data (stream) ...... 125, 143 SKaMPI special Karlsruher MPI-benchmark ...... 190 SL simplified load ...... 84 SQ Send Queue ...... 163 SSA steady-state security assessment...... 3 STL Standard Template Library ...... 36, 145 SUNDIALS SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers ...... 76, 98 SW software ...... 158

TCA Target Channel Adapter ...... 161 TCP Transmission Control Protocol ...... 26, 160 TJIT tracing just-in-time ...... 130

204 List of Acronyms

TLM Transmission Line Modeling ...... 99 TSC time-stamp counter ...... 178 TSO transmission system operator...... 4, 187

UC Unreliable Connection ...... 162 UD Unreliable Datagram ...... 162 UDP User Datagram Protocol ...... 160 ufunc universal function ...... 141 UML Unified Modeling Language ...... 8, 19, 30

VDL voltage dependent load ...... 84 VI virtual interface ...... 160 VIA virtual interface architecture ...... 160 VL Virtual Lanes...... 165 VM virtual machine...... 134 VPP virtual power plant...... 2, 21

WAN wide area network ...... 188 WQE Work Queue Element ...... 163 WR Work Request ...... 163 WSCC Western System Coordinating Council ...... 112, 210

XMI XML Metadata Interchange ...... 31, 186 XML Extensible Markup Language ...... 32

205

Glossary

barrier A barrier is a synchronization primitive e. g. among a set of threads or processes for which holds that each thread / process of the regarding set must execute all in- structions before the barrier before all of them continue with the instructions after the barrier...... 104 driver In context of numerical software (i. e. not hardware driver): a program) which is applying numerical methods, e. g. im- plemented in libraries that are linked to the program, with all needed initializations and parameters on a particular problem to be solved...... 58, 83

flat model The semantics of the Modelica language is specified by means of a set of rules for translating any class described in the Modelica language to a flat Modelica structure (i. e. a flat model). A class must have additional properties in order that its flat Modelica structure can be further transformed into a set of differential, algebraic and discrete equations (i. e. a flat hybrid DAE). [Mod]...... 58

Modelica Modelica is a free object-oriented multi-domain modeling language for component-oriented modeling...... 16, 98

OpenModelica An open-source Modelica-based modeling and simulation environment intended for industrial and academic usage. . 12, 76, 98

207 Glossary pivoting The pivot element of a row or column of a matrix is the first element selected by an algorithm (e. g. during a Gaussian elimination) before a certain calculation step. Finding this element is called pivoting. In Gaussian elimination with pivoting, usually the element with the highest absolute value is chosen...... 78 preordering Computation of permutation matrices which are applied on the matrix to be factorized before the actual factorization step...... 78 race condition A race condition is a condition where the result of concur- rently executed program statements is dependent on the (uncontrollable) execution order of the CPU instructions belonging to the statements...... 136 thread A thread (of execution) is a set of instructions associated with a process. A multi-threaded process has multiple threads. If the computer system allows to run these threads concurrently, the process can benefit from a higher compu- tation computational power...... 81, 100 thread-safe A part of a program is thread-safe if multiple threads can execute the part concurrently, always generating results as if the threads would have executed the part in a sequen- tial order (i. e. one thread executes the whole part, the next thread executes the whole part, and so forth until all threads finished). The sequential order can vary between the executions of the program...... 136 wall clock time The wall clock time is the time which elapses in reality during the measured process...... 84, 111, 150

208 List of Figures

1.1 Contribution overview of this work ...... 7

2.1 Exemplary topology including components of (1) all do- mains and (2) domain-specific topologies ...... 19 2.2 Inter-domain connections between classes of power grid, communication network and market ...... 20 2.3 Communication network class association example .... 22 2.4 Overall SINERGIEN architecture for simulation setup .. 23 2.5 Synchronization scheme of simulators at co-simulation time steps ...... 24 2.6 Scheme of runtime interaction between co-simulation com- ponents ...... 25

3.1 Overall concept of the CIM++ project ...... 34 3.2 UML diagram of HydroPowerPlant class which instances can be associated with no more than one Reservoir instance 36 3.3 UML diagram of the class MyASTVisitor ...... 38 3.4 Section of collaboration diagram for BatteryStorage gen- erated by Doxygen on the automated adapted CIM C++ codebase. The entire diagram can be found in [FEIb] ... 52

4.1 Template engine example with HTML code ...... 59 4.2 Overall concept of the CIMverter project ...... 60 4.3 Mapping at second level between CIM and Modelica objects 63 4.4 Connections with zero, one, and two middle points between the endpoints. The endpoints are marked with circles .. 68 4.5 Medium-voltage benchmark grid [Rud+06] converted from CIM to a system model in Modelica based on the ModPow- erSystems and PowerSystems library ...... 72

209 List of Figures

5.1 Sparsity patterns of benchmark matrices ...... 86 5.2 Total (preprocessing+factorization) times ...... 87 5.3 Preprocessing times ...... 88 5.4 Factorization times ...... 88 5.5 Execution times on generic vs. RT kernel ...... 89 5.6 (Re-)factorization times ...... 90 5.7 NICSLU’s scaling over multiple threads (T ) ...... 91 5.8 Basker’s scaling over multiple threads (T) ...... 91 5.9 Total times with different preorderings ...... 92 5.10 Factorization times with different preorderings ...... 93

6.1 Categories of parallel task scheduling ...... 101 6.2 Example task graph ...... 102 6.3 Example task graph including levels ...... 104 6.4 Schedule for task graph in Fig. 6.2 with p = 2 using level scheduling ...... 104 6.5 Schedule for task graph in Fig. 6.2 with p = 2 using level scheduling considering execution times ...... 105 6.6 Example task graph including b-levels ...... 106 6.7 Schedule for task graph in Fig. 6.2 with p = 2 using Highest Level First with Estimated Times (HLFET) ...... 107 6.8 Example circuit ...... 108 6.9 Task graph resulting from Fig. 6.8 ...... 108 6.10 Western System Coordinating Council (WSCC) 9-bus trans- mission benchmark network ...... 113 6.11 Schematic representation of the connections between system copies ...... 114 6.12 Performance comparison of schedulers for the WSCC 9-bus system ...... 114 6.13 Performance comparison of schedulers for 20 copies of the WSCC 9-bus system ...... 115 6.14 Performance comparison of schedulers for a varying number of copies of the WSCC 9-bus system ...... 116 6.15 Task graph for simulation of the WSCC 9-bus system .. 117 6.16 Performance for a varying number of copies of the WSCC 9-bus system using the decoupled line model ...... 119 6.17 Performance comparison of schedulers for a varying number of copies of the WSCC 9-bus system using the decoupled line model with 8 threads ...... 120 6.18 Performance for a varying number of copies of the WSCC 9-bus system using diakoptics ...... 121

210 List of Figures

6.19 Performance comparison of schedulers for a varying number of copies of the WSCC 9-bus system using diakoptics with 8 threads ...... 122 6.20 Performance comparison of compilers for 20 copies of the WSCC 9-bus system ...... 123

7.1 NumPy ndarray vs. Python list [Van] ...... 133 7.2 Software architecture of CPython (python command) ... 135 7.3 Software architecture of PyPy (pypy command) ...... 137 7.4 Numba compilation stages ...... 143 7.5 Comparison of Cython with other programming languages 144 7.6 Cython’s workflow for Python module building [Dav] ... 146 7.7 Execution times for Quicksort ...... 152 7.8 Memory consumption (maximum heap peak) for Quicksort 153 7.9 Memory consumption (maximum heap peak) for Quicksort of selected runtime environments ...... 154 7.10 Execution times for PI calculations with multiple threads 155

8.1 Overview of the VILLASframework ...... 159 8.2 Network stack of the InfiniBand Architecture (IBA) ... 162 8.3 InfiniBand Architecture (IBA) model ...... 163 8.4 InfiniBand data transmission example ...... 164 8.5 An overview of the OFED stack ...... 166 8.6 An example super-node with three paths connecting five nodes of different node-types ...... 168 8.7 General read function working principle in VILLAS .... 169 8.8 InfiniBand node-type read function working principle ... 170 8.9 VILLASnode state diagram with newly implemented states 172 8.10 Components of InfiniBand node-type ...... 173 8.11 Computer system with NUMA architecture used for mea- surements ...... 177 8.12 VILLASnode node-type benchmark working principle .. 177 8.13 Median latencies etlat over various sample rates ...... 179 8.14 Median latencies etlat over various sample sizes ...... 180 8.15 Median latencies etlat over various sample rates and sample sizes ...... 180 8.16 Median latencies etlat of IB vs. shmem node-type over various sample rates ...... 181 8.17 Median latencies of nanomsg and zeromq node-type over various sample rates via loopback (lo) and physical link . 183 8.18 Median latencies of nanomsg and nanomsg via physical links as well as IB and shmem node-type over various sample rates 183

211 List of Figures

B.1 Execution times for AVL Tree Insertion ...... 195 B.2 Execution times for Dijkstra ...... 196 B.3 Execution times for Gauss-Jordan Elimination ...... 196 B.4 Memory consumption (maximum heap peak) for Cholesky 197 B.5 Memory consumption (maximum heap peak) for Gauss- Jordan Elimination ...... 198 B.6 Memory consumption (maximum heap peak) for Gauss- Jordan Elimination of selected runtime environments ... 198 B.7 Memory consumption (maximum heap peak) for Matrix- Matrix Multiplication ...... 199 B.8 Memory consumption (maximum heap peak) for Matrix- Matrix Multiplication of selected runtime environments . 199

212 List of Tables

4.1 CIM PowerTransformer to Modelica Workshop Transformer mapping ...... 67 4.2 Excerpt of further important mappings from CIM to Mod- PowerSystems as implemented in the Modelica Workshop 68 4.3 Excerpt from the numerical results for node phase-to-phase voltage magnitude and angle regarding the medium-voltage benchmark grid ...... 73

5.1 Characteristics of squared matrices with size N × N, K nodes, sorted by nonzeros NNZ, and with density factor NNZ d = N·N in % ...... 84 5.2 Total execution times and numbers C of calls of the corre- sponding routines within the fixed time step solver, with Jacobian JF and residual function vector F ...... 94 5.3 Accumulated execution times for the listed steps of the variable time step solver, with D LU decompositions and a #Fact. factorization ratio f = #Refact...... 95

6.1 Overview of the implemented schedulers ...... 109 6.2 Overview of the tested compilers ...... 124

213

Bibliography

[Abu+18] A. Abusalah et al. “CPU based parallel computation of elec- tromagnetic transients for large power grids”. In: Electric Power Systems Research 162 (Sept. 2018), pp. 57–63. [ACD74] T. L. Adam, K. M. Chandy, and J. Dickson. “A comparison of list schedules for parallel processing systems”. In: Commu- nications of the ACM 17.12 (1974), pp. 685–690. [ADD04] P. R. Amestoy, T. A. Davis, and I. S. Duff. “Algorithm 837: AMD, an Approximate Minimum Degree Ordering Al- gorithm”. In: ACM Trans. Math. Softw. 30.3 (Sept. 2004), pp. 381–388. issn: 0098-3500. doi: 10.1145/1024074.1024081. [Adr19] Adrien Guironnet. GitHub - dynawo/dynawo. 2019. url: htt ps://github.com/dynawo/dynawo (visited on 10/21/2019). [AH11] D. Allemang and J. Hendler. Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL. Elsevier, 2011. [Aho03] A. Aho. Compilers: Principles, Techniques and Tools (for Anna University),2/e. Pearson Education, 2003. isbn: 978-8- 13176-234-9. [AIA19] AIAitesla. GitHub - itesla/ipsl. 2019. url: https://github. com/itesla/ipsl (visited on 10/21/2019). [Åke+10] J. Åkesson et al. “Modeling and optimization with Optim- ica and JModelica. org—Languages and tools for solving large-scale dynamic optimization problems”. In: Computers & Chemical Engineering 34.11 (2010), pp. 1737–1749. [Ale01] A. Alexandrescu. Modern C++ design: generic programming and design patterns applied. Addison-Wesley, 2001.

215 Bibliography

[Anaa] Anaconda, Inc. Notes on Numba Runtime. url: http:// numba . pydata . org / numba - doc / dev / developer / numba - runtime. (visited on 02/09/2020). [Anab] Anaconda, Inc. Numba architecture. url: http : / / numba . pydata . org / numba - doc / dev / developer / architecture . html (visited on 02/10/2020). [Anac] Anaconda, Inc. Numba: Compilation Options. url: http : //numba.pydata.org/numba-doc/latest/user/jit.html# compilation-options (visited on 02/09/2020). [Anad] Anaconda, Inc. Numba: Just-in-Time compilation. url: http: //numba.pydata.org/numba-doc/latest/reference/jit- compilation.html (visited on 02/09/2020). [Anae] Anaconda, Inc. Numba: LoopJitting. url: http://numba. pydata.org/numba-doc/dev/developer/numba-runtime. html (visited on 02/09/2020). [Anaf] Anaconda, Inc. Numba: Numbers. url: http://numba.pydat a.org/numba-doc/latest/reference/types.html#numbers (visited on 02/09/2020). [Anag] Anaconda, Inc. Numba: Supported NumPy features. url: http: //numba.pydata.org/numba-doc/dev/reference/numpysup ported.html (visited on 02/10/2020). [Anah] Anaconda, Inc. Numba: Supported Python features. url: http: //numba.pydata.org/numba-doc/dev/reference/pysuppor ted.html (visited on 02/10/2020). [Anai] Anaconda, Inc. Numba: Why my loop is not vectorized? url: http : / / numba . pydata . org / numba - doc / dev / user / faq . html#does-numba-vectorize-array-computations-simd (visited on 02/10/2020). [Anaj] Anaconda, Inc. Numba: Why my loop is not vectorized? url: http : / / numba . pydata . org / numba - doc / 0 . 30 . 1 / reference/envvars.html#compilation-options (visited on 02/10/2020). [Apa] Apache Jena. Apache Jena - Home. url: http : / / jena . apache.org (visited on 12/23/2019). [Aro06] P. Aronsson. “Automatic Parallelization of Equation-Based Simulation Programs”. PhD thesis. Institutionen för dataveten- skap, 2006.

216 [BCP96] K. Brenan, S. Campbell, and L. Petzold. Numerical Solution of Initial-Value Problems in Differential-Algebraic Equations. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 1996. isbn: 9780898713534. [BDD91] T. Berry, A. Daniels, and R. Dunn. “Real time simulation of power system transient behaviour”. In: 1991 Third Interna- tional Conference on Power System Monitoring and Control. IET. 1991, pp. 122–127. [Bea] D. Beazley. Understanding the Python GIL. url: http:// www.dabeaz.com/python/UnderstandingGIL.pdf (visited on 02/09/2020). [Bec] Beckett, Dave. Redland RDF Libraries. url: http://librdf. org (visited on 12/23/2019). [Bec10] D. Becker. “Harmonizing the International Electrotechnical Commission Common Information Model (CIM) and 61850”. In: Electric Power Research Institute (EPRI), Tech. Rep 1020098 (2010). [Beha] S. Behnel. Limitations – Cython 3.0a0 documentation. url: http://www.behnel.de/cython200910/talk.html (visited on 02/10/2020). [Behb] S. Behnel. Using Parallelism – Cython 3.0a0 documentation. url: https://cython.readthedocs.io/en/latest/src/ userguide/parallelism.html (visited on 02/10/2020). [Behc] S. Behnel. Using the Cython Compiler to write fast Python code. url: http : / / www . behnel . de / cython200910 / talk . html (visited on 02/10/2020). [Beh+11] S. Behnel et al. “Cython: The Best of Both Worlds”. In: Computing in Science & Engineering 13.2 (2011), p. 31. [Ben] J. Bennett. An introduction to Python bytecode. url: https: //opensource.com/article/18/4/introduction-python- bytecode (visited on 02/09/2020). [BL15] M. Barros and Y. Labiche. Search-Based Software Engineer- ing: 7th International Symposium, SSBSE 2015, Bergamo, Italy, September 5-7, 2015, Proceedings. Lecture Notes in Computer Science. Springer International Publishing, 2015. isbn: 9783319221830.

217 Bibliography

[Bol+09] C. F. Bolz et al. “Tracing the Meta-Level: PyPy’s Trac- ing JIT Compiler”. In: Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object- Oriented Languages and Programming Systems. ICOOOLPS ’09. Genova, Italy: Association for Computing Machinery, 2009, pp. 18–25. isbn: 9781605585413. [Bos11] P. Bose. “Power Wall”. In: Encyclopedia of Parallel Computing. Ed. by D. Padua. Boston, MA: Springer US, 2011, pp. 1593– 1608. isbn: 978-0-387-09766-4. url: https://doi.org/10. 1007/978-0-387-09766-4_499 (visited on 12/24/2019). [Bou13] J.-L. Boulanger. Static Analysis of Software: The Abstract Interpretation. John Wiley & Sons, 2013. [Bra+97] T. Bray et al. “Extensible Markup Language (XML)”. In: World Wide Web Journal 2.4 (1997), pp. 27–66. [Bre12] E. Bressert. SciPy and NumPy: An Overview for Developers. O’Reilly Media, 2012. isbn: 9781449361631. url: https:// books.google.de/books?id=c-xzkDMDev0C. [BRT16] J. D. Booth, S. Rajamanickam, and H. Thornquist. “Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts”. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). May 2016, pp. 673–682. doi: 10.1109/IPDPSW. 2016.92. [BS14a] B. Buchholz and Z. Styczynski. Smart Grids: Grundlagen und Technologien der elektrischen Netze der Zukunft. VDE-Verlag, 2014. isbn: 9783800735624. [BS14b] B. M. Buchholz and Z. Styczynski. Smart grids-fundamentals and technologies in electricity networks. Vol. 396. Springer, 2014. [Büt16] F. Bütow. Zeitgeistwandel: Vom Aufbruch der Neuzeit zum Aufbruch ins planetarische Zeitalter. Books on Demand, 2016. isbn: 9783734741074. [BW12] A. Brown and G. Wilson. PyPy. The Architecture of Open Source Applications. Creative Commons, 2012. isbn: 978-1- 10557-181-7. url: http://aosabook.org/en/pypy.html (visited on 02/09/2020).

218 [Cao+15] J. Cao et al. “A flexible model transformation to link BIM with different Modelica libraries for building energy perfor- mance simulation”. In: Proceedings of the 14th IBPSA Con- ference. 2015. [Cara] A. Carattino. Mutable and Immutable Objects. url: https:// www.pythonforthelab.com/blog/mutable-and-immutable- objects/ (visited on 02/09/2020). [Carb] C. Carey. Why Python is Slow: Looking Under the Hood | Pythonic Perambulations. url: https://github.com/cython /cython/wiki/enhancements-compilerdirectives (visited on 02/10/2020). [Cas13] F. Casella. “A Strategy for Parallel Simulation of Declara- tive Object-Oriented Models of Generalized Physical Net- works”. In: Proceedings of the 5th International Workshop on Equation-Based Object-Oriented Modeling Languages and Tools; April 19; University of Nottingham; Nottingham; UK. 084. Linköping University Electronic Press. 2013, pp. 45–51. [Cas19] S. Cass. “The 2018 Top Programming Languages”. In: IEEE Spectrum (2019). url: https://spectrum.ieee.org/at- work/innovation/the-2018-top-programming-languages (visited on 12/10/2019). [CDZ05] D. Crupnicoff, S. Das, and E. Zahavi. Deploying Quality of Service and Congestion Control in InfiniBand-based Data Center Networks. Tech. rep. 2379. 2005. [Cha+08] B. Chapman et al. Using OpenMP: Portable Shared Mem- ory Parallel Programming. Scientific Computation Series. Books24x7.com, 2008. isbn: 9780262533027. [Che+15] X. Chen et al. “GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling”. In: IEEE Transactions on Parallel and Distributed Systems 26.3 (Mar. 2015), pp. 786–795. doi: 10.1109/tpds.2014.2312199. [Chu01] W. Chun. Core Python Programming. Vol. 1. Prentice Hall Professional, 2001. [CIM] CIM User Group. Home - CIMug. url: http://cimug.ucai ug.org (visited on 12/22/2019). [Cla] Clang community. Clang C Language Family Frontend for LLVM. url: https://clang.llvm.org (visited on 12/22/2019).

219 Bibliography

[Coh] O. Cohen. Is your Numpy optimized for speed? url: https: //towardsdatascience.com/is-your-numpy-optimized- for-speed-c1d2b2ba515 (visited on 02/09/2020). [Com97] Compaq, Intel, Microsoft. Virtual Interface Architecture Spec- ification. Version 1.0. Compaq, Intel, Microsoft. Dec. 1997. [Con07] Congress, 110th United States. Energy Independence and Security Act of 2007. 2007. url: https://www.govinfo.gov/ content/pkg/PLAW-110publ140/html/PLAW-110publ140. htm (visited on 10/21/2019). [Cor+01] T. Cormen et al. Introduction To Algorithms. MIT Press, 2001. isbn: 9780262032933. [CRS+11] CRSA et al. D4.1: Algorithmic requirements for simulation of large network extreme scenarios. Tech. rep. tech. rep., PE- GASE Consortium, 2011. [Cum] M. Cumming. libxml++ – An XML Parser for C++. url: http://libxmlplusplus.sourceforge.net (vis- ited on 12/23/2019). [Cun] A. Cuni. PyPy Status Blog. url: https://morepypy.blogsp ot.com/2018/09/inside-cpyext-why-emulating-cpython- c.html (visited on 02/10/2020). [Cun10] A. Cuni. “High performance implementation of Python for CLI/.NET with JIT compiler generation for dynamic lan- guages”. PhD thesis. Dipartimento di Informatica e Scienze dell’Informazione, 2010. [CWY12] X. Chen, Y. Wang, and H. Yang. “An Adaptive LU Factor- ization Algorithm for Parallel Circuit Simulation”. In: 17th Asia and South Pacific Design Automation Conference. Jan. 2012, pp. 359–364. doi: 10.1109/ASPDAC.2012.6164974. [CWY13] X. Chen, Y. Wang, and H. Yang. “NICSLU: An Adaptive Sparse Matrix Solver for Parallel Circuit Simulation”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32.2 (Feb. 2013), pp. 261–274. doi: 10.1109/tcad.2012.2217964. [Cyt] Cython community. Cython: C-Extensions for Python. url: https://cython.org (visited on 12/11/2019). [Dal] L. Dalcin. MPI for Python – MPI for Python 3.0.3 documen- tation. url: https://mpi4py.readthedocs.io/en/stable/ (visited on 02/10/2020).

220 [Dav] DavidBrooksPokorny. Cython. url: https://en.wikipedia. org / wiki / Cython # /media / File : Cython _ CPython _ Ext _ Module_Workflow.png (visited on 02/10/2020). [Dav+04] T. A. Davis et al. “Algorithm 836: COLAMD, a Column Approximate Minimum Degree Ordering Algorithm”. In: ACM Trans. Math. Softw. 30.3 (Sept. 2004), pp. 377–380. issn: 0098-3500. [Dav03] F. S. David. Model driven architecture: applying MDA to enterprise computing. 2003. [Daw] Dawes, Beman. Filesystem Home - Boost.org. url: http:// www.boost.org/libs/filesystem (visited on 12/23/2019). [Deba] Debian Wiki team. Hugepages – Debian Wiki. url: https: //wiki.debian.org/Hugepages (visited on 02/14/2020). [Debb] A. Debrie. Python Garbage Collection: What It Is and How It Works. url: https://stackify.com/python-garbage- collection/ (visited on 02/09/2020). [Die07] S. Diehl. Software visualization: visualizing the structure, be- haviour, and evolution of software. Springer Science & Busi- ness Media, 2007. [Dig] Digi International Inc. Python garbage collection. url: ht tps://www.digi.com/resources/documentation/digido cs/90001537/references/r_python_garbage_coll.htm (visited on 02/09/2020). [Din+18] J. Dinkelbach et al. “Hosting Capacity Improvement Unlocked by Control Strategies for Photovoltaic and Battery Storage Systems”. In: 2018 Power Systems Computation Conference (PSCC). IEEE. 2018, pp. 1–7. [DK12] P. Dutta and M. Kezunovic. “Unified representation of data and model for sparse measurement based fault location”. In: Power and Energy Society General Meeting, 2012 IEEE. IEEE. 2012, pp. 1–8. [DM95] F.-N. Demers and J. Malenfant. “Reflection in logic, func- tional and object-oriented programming: a short comparative study”. In: Proceedings of the IJCAI. Vol. 95. 1995, pp. 29–38. [DMS14] E. B. Duffy, B. A. Malloy, and S. Schaub. “Exploiting the Clang AST for Analysis of C++ Applications”. In: Proceedings of the 52nd Annual ACM Southeast Conference. 2014.

221 Bibliography

[DP10] T. A. Davis and E. Palamadai Natarajan. “Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems”. In: ACM Trans. Math. Softw. 37.3 (Sept. 2010), 36:1–36:17. issn: 0098-3500. doi: 10.1145/1824801.1824814. [DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer-Lehrbuch. Springer Berlin Hei- delberg, 2008. isbn: 9783540764939. [DRM20] S. Dähling, L. Razik, and A. Monti. “OWL2Go: Auto-genera- tion of Go data models for OWL ontologies with integrated serialization and deserialization functionality”. In: To appear in SoftwareX (2020). [Dun+98] D. Dunning et al. “The Virtual Interface Architecture”. In: IEEE micro 18.2 (Mar. 1998), pp. 66–76. issn: 0272-1732. [Eat19] J. Eaton. GNU Octave. 2019. url: https://www.gnu.org/ software/octave (visited on 11/25/2019). [ecm19] ecma International. Standard ECMA-404 – The JSON Data Interchange Syntax. 2019. url: http://www.ecma-interna tional.org/publications/files/ECMA-ST/ECMA-404.pdf (visited on 10/23/2019). [ECT17] D. Efnusheva, A. Cholakoska, and A. Tentov. “A Survey OF Different Approaches for Overcoming the Processor-Memory Bottleneck”. In: International Journal of Computer Science & Information Technolog (2017). doi: 10.5121/ijcsit.2017. 9214. [EFF15] L. Exel, F. Felgner, and G. Frey. “Multi-domain modeling of distributed energy systems-The MOCES approach”. In: Smart Grid Communications (SmartGridComm), 2015 IEEE International Conference on. IEEE. 2015, pp. 774–779. [Eig19] Eigen Developers. Eigen. Aug. 2019. url: http://eigen. tuxfamily.org (visited on 10/21/2019). [Ell+01] J. Ellson et al. “Graphviz—open source graph drawing tools”. In: International Symposium on Graph Drawing. Springer. 2001, pp. 483–484. [ENT] ENTSO-E. COMMON INFORMATION MODEL (CIM) – MODEL EXCHANGE PROFILE 1. url: https://docstore. entsoe.eu/Documents/CIM_documents/Grid_Model_CIM/ 140610 _ ENTSO - E _ CIM _ Profile _ v1 _ UpdateIOP2013 . pdf (visited on 05/13/2018).

222 [ENT16] ENTSO-E. Common Grid Model Exchange Specification (CGMES) – Version 2.5. 2016. url: https : / / docstore . entsoe.eu/Documents/CIM_documents/IOP/CGMES_2_5_ TechnicalSpecification_61970-600_Part%201_Ed2.pdf (visited on 11/21/2019). [Eur] European Grid Infrastructure community. Glossary V1 - EGI- Wiki. url: https://wiki.egi.eu/wiki/Glossary_V1#High_ Throughput_Computing (visited on 12/24/2019). [Fab+11] D. Fabozzi et al. “On simplified handling of state events in time-domain simulation”. In: Proc. of the 17th Power System Computation Conference PSCC. 2011. [Far+15] M. O. Faruque et al. “Real-Time Simulation Technologies for Power Systems Design, Testing, and Analysis”. In: IEEE Power and Energy Technology Systems Journal 2.2 (2015), pp. 63–73. [FC09] D. Fabozzi and T. V. Cutsem. “Simplified time-domain simu- lation of detailed long-term dynamic models”. In: 2009 IEEE Power & Energy Society General Meeting. IEEE, July 2009. [FEIa] FEIN Aachen e. V. DistAIX – Scalable simulation of cyber- physical power distribution systems. url: https://fein-aac hen.org/en/projects/distaix/ (visited on 12/26/2019). [FEIb] FEIN Aachen e. V. Doxygen generated webpages of CIM++ Adapted CIM_SINERGIEN Codebase: BatteryStorage Class Reference. url: https://cim.fein-aachen.org/libcimpp/ doc/IEC61970_16v29a_SINERGIEN_20170324//classSinerg ien_1_1EnergyGrid_1_1EnergyStorage_1_1BatteryStorag e.html (visited on 12/23/2019). [FEIc] FEIN Aachen e. V. Doxygen generated webpages of CIM++ Adapted CIM_SINERGIEN Codebase: PowerTransformer Class Reference. url: http : / / cim . fein - aachen . org / libcimpp/doc/IEC61970_16v29a_IEC61968_12v08/class IEC61970_1_1Base_1_1Wires_1_1PowerTransformer.html (visited on 05/31/2018). [FEId] FEIN Aachen e. V. Doxygen generated webpapes of CIM++ Adapted CIM_SINERGIEN Codebase. url: https://cim. fein-aachen.org/libcimpp/doc/IEC61970_16v29a_SINERG IEN_20170324/ (visited on 12/23/2019).

223 Bibliography

[FEIe] FEIN Aachen e. V. IEC61970 16v29a - IEC61968 12v08: Class List. url: https://cim.fein-aachen.org/libcimpp/ doc/IEC61970_16v29a_IEC61968_12v08/annotated.html (visited on 12/26/2019). [FEIf] FEIN Aachen e. V. VILLAS. url: https://villas.fein- aachen.org/website (visited on 02/14/2020). [FEIg] FEIN Aachen e. V. VILLASframework: Node-types. url: h ttps://villas.fein-aachen.org/doc/node-types.html (visited on 02/14/2020). [FEI19a] FEIN Aachen e.V. CIM++. 2019. url: https : / / www . f ein-aachen.org/projects/modpowersystems (visited on 11/21/2019). [FEI19b] FEIN Aachen e.V. ModPowerSystems. 2019. url: https:// www.fein-aachen.org/projects/modpowersystems (visited on 10/21/2019). [Fin+08] R. Finocchiaro et al. “ETHOS, a generic Ethernet over Sock- ets Driver for Linux”. In: Proceedings of the 20th IASTED International Conference. Vol. 631. 017. 2008, p. 239. [Fin+09a] R. Finocchiaro et al. “ETHOM, an Ethernet over SCI and DX Driver for Linux”. In: Proceedings of 2009 International Conference of Parallel and Distributed Computing (ICPDC 2009), London, UK. 2009. [Fin+09b] R. Finocchiaro et al. “Low-Latency Linux Drivers for Ethernet over High-Speed Networks”. In: IAENG International Journal of Computer Science 36.4 (2009). [Fin+10] R. Finocchiaro et al. “Transparent Integration of a Low- Latency Linux Driver for Dolphin SCI and DX”. In: Electronic Engineering and Computing Technology. Ed. by S.-I. Ao and L. Gelman. Dordrecht: Springer Netherlands, 2010, pp. 539– 549. isbn: 978-90-481-8776-8. doi: 10.1007/978-90-481- 8776-8_46. [FM09] M. Foord and C. Muirhead. IronPython in Action. Manning Pubs Co Series. Manning, 2009. isbn: 9781933988337. url: http://www.voidspace.org.uk/python/articles/duck_ typing.shtml#duck-typing (visited on 02/09/2020). [FO08] H.-r. Fang and D. P. O’Leary. “Modified Cholesky algorithms: a catalog with new approaches”. In: Mathematical Program- ming 115.2 (Oct. 2008), pp. 319–349. issn: 1436-4646.

224 [Fou+07] L. Fousse et al. “MPFR: A Multiple-Precision Binary Floating- Point Library With Correct Rounding”. In: ACM Transactions on Mathematical Software (TOMS) 33.2 (2007), p. 13. [Fow02] M. Fowler. Patterns of Enterprise Application Architecture. Addison-Wesley Longman Publishing Co., Inc., 2002. [Fra19] Fraunhofer IEE and University of Kassel Revision. PYPOWER. 2019. url: https://pandapower.readthedocs.io (visited on 11/25/2019). [Fre+09] J. Fremont et al. “CIM extensions for ERDF information sys- tem projects”. In: Power & Energy Society General Meeting, 2009. PES’09. IEEE. IEEE. 2009, pp. 1–5. [Fri+06] P. Fritzson et al. “OpenModelica-A free open-source envi- ronment for system modeling, simulation, and teaching”. In: Computer Aided Control System Design, 2006 IEEE Inter- national Conference on Control Applications, 2006 IEEE International Symposium on Intelligent Control, 2006 IEEE. IEEE. 2006, pp. 1588–1595. [Fri15a] P. Fritzson. Principles of Object-Oriented Modeling and Sim- ulation with Modelica 3.3: A Cyber-Physical Approach. Wiley, 2015. isbn: 9781118859162. [Fri15b] P. A. Fritzson. Principles of object oriented modeling and simulation with Modelica 3.3. 2nd ed. Hoboken: John Wiley & Sons, 2015. [Fri16] J. Friesen. Java XML and JSON. Apress, 2016. [FW14] R. Franke and H. Wiesmann. “Flexible modeling of electri- cal power systems–the Modelica PowerSystems library”. In: Proceedings of the 10 th International Modelica Conference; March 10-12; 2014; Lund; Sweden. 096. Linköping University Electronic Press. 2014, pp. 515–522. [Gag+19] F. Gagliardi et al. “The international race towards Exas- cale in Europe”. In: CCF Transactions on High Performance Computing (2019), pp. 1–11. [GCC] GCC team. GCC, the GNU Compiler Collection. url: https: //gcc.gnu.org (visited on 12/22/2019). [GDD+06] D. Ga, D. Djuric, V. Deved, et al. Model Driven Architec- ture and Ontology Development. Springer Science & Business Media, 2006.

225 Bibliography

[Geb+12] M. Gebremedhin et al. “A Data-Parallel Algorithmic Mod- elica Extension for Efficient Execution on Multi-Core Plat- forms”. In: Proceedings of the 9th International MODEL- ICA Conference; September 3-5; 2012; Munich; Germany. 76. Linköping University Electronic Press; Linköpings universitet, 2012, pp. 393–404. [Geo73] A. George. “Nested Dissection of a Regular Finite Element Mesh”. In: SIAM Journal on Numerical Analysis 10.2 (1973), pp. 345–363. [Ger] German Aerospace Center (DLR). DLR – Simulation and Software Technology – 8th Workshop on Python for High- Performance and Scientific Computing. url: https://www. dlr.de/sc/en/desktopdefault.aspx/tabid-12954/22625_ read-52397/ (visited on 12/11/2019). [Gir] C. Giridhar. Understanding Python GIL. url: https://callh ub.io/understanding-python-gil/ (visited on 02/09/2020). [Glo] A. Gloubin. Garbage collection in Python: things you need to know. url: https://rushter.com/blog/python-garbage- collector (visited on 02/09/2020). [Gra69] R. L. Graham. “Bounds on multiprocessing timing anoma- lies”. In: SIAM journal on Applied Mathematics 17.2 (1969), pp. 416–429. [Gre+16] F. Gremse et al. “GPU-accelerated adjoint algorithmic differ- entiation”. In: Computer Physics Communications 200 (2016), pp. 300–311. issn: 0010-4655. doi: 10.1016/j.cpc.2015.10. 027. [Gui+18] A. Guironnet et al. “Towards an open-source solution using Modelica for time-domain simulation of power systems”. In: Proc. 8th IEEE PES ISGT Europe. Sarajevo, Bosnia and Herzegovina, Oct. 2018. [Haq+11] E. Haq et al. “Use of Common Information Model (CIM) in electricity market at California ISO”. In: Power and Energy Society General Meeting, 2011 IEEE. IEEE. 2011, pp. 1–6. [Har+12] W. E. Hart et al. Pyomo – Optimization Modeling in Python. Vol. 67. Springer, 2012. [Hee] D. van Heesch. Doxygen: Main Page. url: http : / / www . doxygen.org (visited on 12/23/2019).

226 [Heg+01] P. Heggernes et al. The Computational Complexity of the Minimum Degree Algorithm. Tech. rep. Lawrence Livermore National Lab., CA (US), 2001. [Hen] Henney, Kevlin. Chapter 5. Boost.Any. url: http : / / ww w . boost . org / doc / libs / release / libs / any (visited on 12/23/2019). [Hig] J. Higgins. Arabica. url: https://github.com/RWTH-ACS/ arabica (visited on 12/23/2019). [Hin+05] A. C. Hindmarsh et al. “SUNDIALS: Suite of Nonlinear and Differential/Algebraic Equation Solvers”. In: ACM Trans. Math. Softw. 31.3 (Sept. 2005), pp. 363–396. issn: 0098-3500. doi: 10.1145/1089014.1089020. [Hop+06] K. Hopkinson et al. “EPOCHS: a platform for agent-based electric power and communication simulation built from com- mercial off-the-shelf components”. In: IEEE Transactions on Power Systems 21.2 (2006), pp. 548–558. [HR07] S. C. Haw and G. R. K. Rao. “A Comparative Study and Benchmarking on XML Parsers”. In: Advanced Communica- tion Technology, The 9th International Conference on. Vol. 1. IEEE. 2007, pp. 321–325. [HSC19] A. C. Hindmarsh, R. Serban, and A. Collier. User Documen- tation for IDA v4.1.0. 2019. url: https://computing.llnl. gov/sites/default/files/public/ida_guide.pdf (visited on 10/21/2019). [IEC] IEC. IEC Smart Grid - IEC Standards. url: http://www. iec.ch/smartgrid/standards (visited on 12/22/2019). [IEC06] IEC. IEC 61970-501:2006 Energy management system appli- cation program interface (EMS-API) – Part 501: Common Information Model Resource Description Framework (CIM RDF) schema. 2006. [IEC12a] IEC. IEC 61968-11:2013 Application integration at electric utilities - System interfaces for distribution management – Part 11: Common information model (CIM) extensions for distribution. 2012. [IEC12b] IEC. IEC 61970-301:2012 Energy management system appli- cation program interface (EMS-API) – Part 301: Common Information Model (CIM) base. 2012.

227 Bibliography

[IEC14] IEC. IEC 62325-301:2014 Framework for energy market com- munications – Part 301: Common information model (CIM) extensions for markets. 2014. [IEC16a] IEC. IEC 61970-552:2016 Energy management system appli- cation program interface (EMS-API) - Part 552: CIMXML Model exchange format. 2016. [IEC16b] IEC. IEC/TR 62357-1:2016 Power systems management and associated information exchange - Part 1: Reference architec- ture. 2016. [IEC17] IEC. IEC TS 62361-102 ED1 Power systems management and associated information exchange - Interoperability in the long term - Part 102: CIM - IEC 61850 harmonization. 2017. [IEE18] IEEE and The Open Group. The Open Group Base Specifi- cations Issue 7 – IEEE Std 1003.1, 2018 Edition. New York, NY, USA: IEEE, 2018. url: http://pubs.opengroup.org/ onlinepubs/9699919799. [Inf07] InfiniBand Trade Association. InfiniBand Architecture Specifi- cation, Volume 1. Release 1.2.1. InfiniBand Trade Association et al. Nov. 2007. [Inf16] InfiniBand Trade Association. InfiniBand Architecture Specifi- cation Volume 2. Release 1.3.1. InfiniBand Trade Association et al. Nov. 2016. [Int] Intel Corporation. Intel C++ Compiler. url: https : / / software . intel . com / en - us / c - compilers (visited on 12/22/2019). [ISO14] ISO. ISO/IEC JTC 1/SC 22/WG 21 N4100 Programming Languages – C++ – File System Technical Specification. 2014. [Jos12] N. Josuttis. The C++ Standard Library: A Tutorial and Reference. Addison-Wesley, 2012. isbn: 9780321623218. [KA99] Y.-K. Kwok and I. Ahmad. “Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors”. In: ACM Comput. Surv. 31.4 (Dec. 1999), pp. 406–471. issn: 0360-0300. [Kas17] S. Kaster. Runtime Analysis of Python Programs. 2017. [Ker] Kernel development community. Networking – The Linux Kernel documentation. url: https://linux-kernel-lab s.github.io/master/labs/networking.html (visited on 02/14/2020).

228 [Ker10] M. Kerrisk. The Linux Programming Interface: a Linux and UNIX System Programming Handbook. No Starch Press, 2010. isbn: 978-1-59327-220-3. [KH02] J. Kovse and T. Härder. “Generic XMI-based UML model transformations”. In: Object-Oriented Information Systems (2002), pp. 183–190. [KH14] G. Krüger and H. Hansen. Java-Programmierung – Das Hand- buch zu Java 8. O’Reilly Germany, 2014. [Kha+18] S. Khayyamim et al. “Railway System Energy Management Optimization Demonstrated at Offline and Online Case Stud- ies”. In: IEEE Transactions on Intelligent Transportation Systems 19.11 (Nov. 2018), pp. 3570–3583. issn: 1524-9050. doi: 10.1109/TITS.2018.2855748. [KK04] W. Kocay and D. Kreher. Graphs, Algorithms, and Optimiza- tion. Discrete Mathematics and Its Applications. CRC Press, 2004. isbn: 978-0-20348-905-5. [KK95] G. Karypis and V. Kumar. METIS – Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 2.0. Tech. rep. University of Minnesota, Department of Computer Science, 1995. [Kle] B. Klein. Python3-Tutorial: Parameterübergabe. url: https: //www.python-kurs.eu/python3_parameter.php (visited on 02/09/2020). [KMS92] M. S. Khaira, G. L. Miller, and T. J. Sheffler. Nested Dissec- tion: A survey and comparison of various nested dissection algorithms. Carnegie-Mellon University. Department of Com- puter Science, 1992. [Kol+02] R. Kollmann et al. “A Study on the Current State of the Art in Tool-Supported UML-Based Static Reverse Engineering”. In: Reverse Engineering, 2002. Proceedings. Ninth Working Conference on. IEEE. 2002, pp. 22–32. [Kol+18] S. Kolen et al. “Enabling the Analysis of Emergent Behavior in Future Electrical Distribution Systems Using Agent-Based Modeling and Simulation”. In: Complexity 2018 (2018). [Kor09] R. E. Korf. “Multi-Way Number Partitioning”. In: Twenty- First International Joint Conference on Artificial Intelligence. 2009.

229 Bibliography

[KWS07] U. Kastens, W. M. Waite, and A. M. Sloane. Generating Software from Specifications. Jones & Bartlett Learning, 2007. [Lar+09] S. Larsen et al. “Architectural breakdown of end-to-end la- tency in a TCP/IP network”. In: International Journal of Parallel Programming 37.6 (Dec. 2009), pp. 556–571. issn: 1573-7640. doi: 10.1007/s10766-009-0109-6. [LE] T. Lefebvre and H. Englert. IEC TC57 Power system man- agement and associated information exchange. url: https: //www.iec.ch/resources/tcdash/Poster_IEC_TC57.pdf (visited on 02/22/2020). [Lee+15] B. Lee et al. “Unifying data types of IEC 61850 and CIM”. In: IEEE Transactions on Power Systems 30.1 (2015), pp. 448– 456. [Li+14] W. Li et al. “Cosimulation for Smart Grid Communications”. In: IEEE Transactions on Industrial Informatics 10.4 (2014), pp. 2374–2384. [Lin] R. Lincoln. PyCIM – Python implementation of the Common Information Model. url: https://github.com/rwl/pycim (visited on 12/23/2019). [Lin+12] H. Lin et al. “GECO: Global event-driven co-simulation frame- work for interconnected power system and communication network”. In: IEEE Transactions on Smart Grid 3.3 (2012), pp. 1444–1456. [Lin19a] R. Lincoln. PYPOWER. 2019. url: https://pypi.org/ project/PYPOWER/ (visited on 11/25/2019). [Lin19b] R.-T. Linux. realtime:start [Wiki]. 2019. url: https : / / wiki.linuxfoundation.org /realtime/start (visited on 10/21/2019). [LK17] B. Lee and D.-K. Kim. “Harmonizing IEC 61850 and CIM for connectivity of substation automation”. In: Computer Standards & Interfaces 50 (2017), pp. 199–208. [LLV] LLVM Foundation. The LLVM Compiler Infrastructure Project. url: http://www.llvm.org (visited on 12/23/2019). [LPS15] S. K. Lam, A. Pitrou, and S. Seibert. “Numba: a LLVM-based Python JIT compiler”. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM. 2015, p. 7.

230 [Lun+09] H. Lundvall et al. “Automatic Parallelization of Simulation Code for Equation-based Models with Software Pipelining and Measurements on Three Platforms”. In: SIGARCH Comput. Archit. News 36.5 (June 2009), pp. 46–55. issn: 0163-5964. [Mae12] K. Maeda. “Performance Evaluation of Object Serialization Libraries in XML, JSON and Binary Formats”. In: Digital Information and Communication Technology and it’s Appli- cations (DICTAP), 2012 Second International Conference on. IEEE. 2012, pp. 177–182. [Man] Man-Pages Authors. memusage(1) - Linux manual page. url: http://man7.org/linux/man-pages/man1/memusage.1. html (visited on 02/10/2020). [MAT19] MATPOWER Developers. GNU Octave. 2019. url: https: //matpower.org (visited on 11/25/2019). [McM07] A. W. McMorran. “An Introduction to IEC 61970-301 & 61968-11: The Common Information Model”. In: University of Strathclyde 93 (2007), p. 124. [MDC09] A. Mercurio, A. Di Giorgio, and P. Cioci. “Open-Source Implementation of Monitoring and Controlling Services for EMS/SCADA Systems by Means of Web Services – IEC 61850 and IEC 61970 Standards”. In: IEEE Transactions on Power Delivery 24.3 (2009), pp. 1148–1153. [Mel18] Mellanox Technologies. Mellanox OFED for Linux User Man- ual. 2877. Rev 4.3. Mellanox Technologies. Mar. 2018. [Min] S. Mingshen. Getting Started. url: http://mesapy.org/rp ython-by-example/getting-started/index.html (visited on 02/09/2020). [Mir+17] M. Mirz et al. “Dynamic phasors to enable distributed real- time simulation”. In: 2017 6th International Conference on Clean Electrical Power (ICCEP). June 2017, pp. 139–144. [Mir+18] M. Mirz et al. “A Cosimulation Architecture for Power System, Communication, and Market in the Smart Grid”. In: Hindawi Complexity 2018 (Feb. 2018). doi: 10.1155/2018/7154031. [Mir+19] M. Mirz et al. “DPsim—A dynamic phasor real-time simulator for power systems”. In: SoftwareX 10 (2019), p. 100253. issn: 2352-7110. doi: https://doi.org/10.1016/j.softx.2019. 100253. url: http://www.sciencedirect.com/science/ article/pii/S2352711018302760.

231 Bibliography

[Mir20] M. Mirz. “A Dynamic Phasor Real-Time Simulation Based Digital Twin for Power Systems”. PhD thesis. RWTH Aachen University, 2020. [MMS13] N. V. Mago, J. D. Moseley, and N. Sarma. “A methodol- ogy for modeling telemetry in power systems models using IEC-61968/61970”. In: Innovative Smart Grid Technologies- Asia (ISGT Asia), 2013 IEEE. IEEE. 2013, pp. 1–6. [MNM16] M. Mirz, L. Netze, and A. Monti. “A multi-level approach to power system Modelica models”. In: Control and Modeling for Power Electronics (COMPEL), 2016 IEEE 17th Workshop on. IEEE. 2016, pp. 1–7. [Mod] Modelica Association. Introduction – Modelica Language Spec- ification 3.3 Revision 1. url: https : / / modelica . readt hedocs . io / en / latest / introduction . html (visited on 12/26/2019). [Mol+14] C. Molitor et al. “MESCOS–A Multienergy System Cosimula- tor for City District Energy Systems”. In: IEEE Transactions on Industrial Informatics 10.4 (2014), pp. 2247–2256. [Mon] Montaigne, Michel de. Native datatypes – Dive Into Python 3. url: https://diveintopython3.net/native-datatypes. html (visited on 02/09/2020). [Mon+18] A. Monti et al. “A Global Real-Time Superlab: Enabling High Penetration of Power Electronics in the Electric Grid”. In: IEEE Power Electronics Magazine 5.3 (Sept. 2018), pp. 35– 44. [MPT78] M. D. McIlroy, E. Pinson, and B. Tague. “UNIX Time-Sharing System: Foreword”. In: Bell Labs Technical Journal 57.6 (1978), pp. 1899–1904. [MR12] P. MacArthur and R. D. Russell. “A Performance Study to Guide RDMA Programming Decisions”. In: High Perfor- mance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012, pp. 778–785. [Mül03] M. S. Müller. “An OpenMP Compiler Benchmark”. In: Sci- entific Programming 11.2 (2003), pp. 125–131.

232 [MV11] A. Meister and C. Vömel. Numerik linearer Gleichungssys- teme: Eine Einführung in moderne Verfahren. Mit MAT- LAB®-Implementierungen von C. Vömel. Vieweg+Teubner Verlag, 2011. isbn: 9783834881007. [Nic+00] U. A. Nickel et al. “Roundtrip engineering with FUJABA”. In: Proceedings of the 2nd Workshop on Software-Reengineering (WSR), August. Citeseer. 2000. [Numa] Numba community. Numba: A High Performance Python Compiler. url: http : / / numba . pydata . org (visited on 12/11/2019). [Numb] NumPy developers. NumPy. url: https://numpy.org (vis- ited on 12/11/2019). [Ope19a] OpenModelica Developers. Major OpenModelica Releases. 2019. url: https://www.openmodelica.org/doc/OpenMo delicaUsersGuide/latest/tracreleases.html#release- notes-for-openmodelica-1-11-0 (visited on 10/21/2019). [Ope19b] OpenMP Architecture Review Board. Home – OpenMP. 2019. url: https://www.openmp.org (visited on 10/21/2019). [Pan09] J. Z. Pan. “Resource Description Framework”. In: Handbook on Ontologies. Springer, 2009, pp. 71–90. [Par04] T. J. Parr. “Enforcing strict model-view separation in tem- plate engines”. In: Proceedings of the 13th international con- ference on World Wide Web. ACM. 2004, pp. 224–233. [Pet82] L. Petzold. Description of DASSL: a differential/algebraic system solver. Tech. rep. Sandia National Labs., Livermore, CA (USA), Sept. 1982. [Pfi01] G. F. Pfister. “An Introduction to the Infiniband Architec- ture”. In: High Performance Mass Storage and Parallel I/O 42 (2001), pp. 617–632. [Pic+16] S. Pickartz et al. “Migrating LinuX Containers Using CRIU”. In: High Performance Computing. Ed. by M. Taufer, B. Mohr, and J. M. Kunkel. Cham: Springer International Publishing, 2016, pp. 674–684. isbn: 978-3-319-46079-6. [Pot18] D. Potter. Implementation and Analysis of an InfiniBand based Communication in a Real-Time Co-Simulation Frame- work. 2018.

233 Bibliography

[Pra+11] Y. Pradeep et al. “CIM-Based Connectivity Model for Bus- Branch Topology Extraction and Exchange”. In: IEEE Trans- actions on Smart Grid 2.2 (June 2011), pp. 244–253. issn: 1949-3061. doi: 10.1109/TSG.2011.2109016. [Pre12] J. Preshing. A Look Back at Single-Threaded CPU Perfor- mance. 2012. url: https : / / preshing . com / 20120208 / a - look-back-at-single-threaded-cpu-performance/ (vis- ited on 10/21/2019). [Pug16] J. F. Puget. A Speed Comparison Of C, Julia, Python, Numba, and Cython on LU Factorization. 2016. url: https://www. ibm.com/developerworks/community/blogs/jfp/entry/ A_Comparison_Of_C_Julia_Python_Numba_Cython_Scip y_and_BLAS_on_LU_Factorization?lang=en (visited on 02/10/2020). [PyPa] PyPy community. Bytecode Interpreter. url: http://doc. pypy.org/en/latest/interpreter.html#introduction- and-overview (visited on 02/09/2020). [PyPb] PyPy community. Garbage Collection in PyPy. url: https: //doc.pypy.org/en/release-2.4.x/garbage_collection. html (visited on 02/09/2020). [PyPc] PyPy community. Goals and Architecture Overview. url: http://doc.pypy.org/en/latest/architecture.html#id1 (visited on 02/09/2020). [PyPd] PyPy community. Incminimark. url: https://doc.pypy. org/en/latest/gc_info.html#incminimark (visited on 02/09/2020). [PyPe] PyPy community. RPython Documentation. url: https:// rpython.readthedocs.io/en/latest/index.html#index (visited on 02/09/2020). [Pyta] Python Software Foundation. array — Efficient arrays of numeric values. url: https://docs.python.org/3/library/ array.html#module-array (visited on 02/09/2020). [Pytb] Python Software Foundation. CPython. url: https://www. python.org (visited on 12/12/2019). [Pytc] Python Software Foundation. multiprocessing – Process-based parallelism. url: https://docs.python.org/3.6/library/ multiprocessing.html (visited on 02/09/2020).

234 [Pytd] Python Software Foundation. Python Software Foundation: Press Release 20-Dec-2019. url: https://www.python.org/ psf/press-release/pr20191220/ (visited on 02/09/2020). [Pyte] Python Software Foundation. threading – Thread-based par- allelism. url: https://docs.python.org/3.6/library/ threading.html (visited on 02/09/2020). [Qui03] M. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill, 2003. isbn: 9780071232654. [Ray03] E. S. Raymond. The art of Unix programming. Addison- Wesley Professional, 2003. [Raz+18a] L. Razik et al. “Automated deserializer generation from CIM ontologies: CIM++—an easy-to-use and automated adaptable open-source library for object deserialization in C++ from documents based on user-specified UML models following the Common Information Model (CIM) standards for the energy sector”. In: Computer Science - Research and Development 33.1 (Feb. 2018), pp. 93–103. issn: 1865-2042. doi: 10.1007/ s00450-017-0350-y. [Raz+18b] L. Razik et al. “CIMverter—a template-based flexibly ex- tensible open-source converter from CIM to Modelica”. In: Energy Informatics 1.1 (Oct. 2018), p. 47. issn: 2520-8942. doi: 10.1186/s42162-018-0031-5. [Raz+19a] L. Razik et al. “A comparative analysis of LU decomposition methods for power system simulations”. In: 2019 IEEE Milan PowerTech. June 2019, pp. 1–6. [Raz+19b] L. Razik et al. “REM-S-–Railway Energy Management in Real Rail Operation”. In: IEEE Transactions on Vehicular Technology 68.2 (Feb. 2019), pp. 1266–1277. doi: 10.1109/ TVT.2018.2885007. [Rei19] G. Reinke. Development of a Dependency Analysis between Power System Simulation Components for their Parallel Pro- cessing. 2019. [Reu+16] R. H. Reussner et al. Modeling and Simulating Software Ar- chitectures: The Palladio Approach. MIT Press, 2016. [Ris+16] S. Ristov et al. “Superlinear speedup in HPC systems: Why and when?” In: 2016 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE. 2016, pp. 889–898.

235 Bibliography

[RJB04] J. Rumbaugh, I. Jacobson, and G. Booch. Unified Modeling Language Reference Manual, The (2Nd Edition). Pearson Higher Education, 2004. isbn: 0321245628. [Rog16] A. Roghult. “Benchmarking Python Interpreters”. In: KTH Royal Institute of Technology School (2016). url: http://kth. diva-portal.org/smash/get/diva2:912464/FULLTEXT01. pdf. [Roo99] S. Roosta. Parallel Processing and Parallel Algorithms: The- ory and Computation. Springer New York, 1999. isbn: 978-0- 38798-716-3. [Rosa] P. Ross. Cython Function Declarations – Cython def, cdef and cpdef functions 0.1.0 documentation. url: https : / / notes-on-cython.readthedocs.io/en/latest/function_ declarations.html (visited on 02/10/2020). [Rosb] G. van Rossum. What’s New In Python 3.0. url: https: / / docs . python . org / 3 / whatsnew / 3 . 0 . html (visited on 02/09/2020). [Rud+06] K. Rudion et al. “Design of benchmark of medium voltage distribution network for investigation of DG integration”. In: Power Engineering Society General Meeting, 2006. IEEE. IEEE. 2006, 6–pp. [Sad+09] A. Sadovykh et al. “On Study Results: Round Trip Engineer- ing of Space Systems”. In: European Conference on Model Driven Architecture-Foundations and Applications. Springer. 2009, pp. 265–276. [Sch+15] F. Schloegl et al. “Towards a classification scheme for co- simulation approaches in energy systems”. In: Smart Electric Distribution Systems and Technologies (EDST), 2015 Inter- national Symposium on. IEEE. 2015, pp. 516–521. [Sch11] S. Schütte. “A Domain-Specific Language For Simulation Composition”. In: ECMS. 2011, pp. 146–152. [Sch19] S. Scherfke. mosaik Documentation — Release 2.5.1. June 2019. url: https://media.readthedocs.org/pdf/mosaik/ latest/mosaik.pdf (visited on 10/21/2019). [Scia] SciPy community. Broadcasting. url: http://scipy-lect ures.org/intro/numpy/operations.html#broadcasting (visited on 02/09/2020).

236 [Scib] SciPy community. Casting Rules. url: https://docs.scipy. org/doc/numpy/reference/ufuncs.html#casting-rules (visited on 02/09/2020). [Scic] SciPy community. Universal functions (ufunc). url: https: //docs.scipy.org/doc/numpy/reference/ufuncs.html (visited on 02/09/2020). [Scid] SciPy developers. SciPy.org. url: https://www.scipy.org (visited on 12/11/2019). [ŠD12] V. Štuikys and R. Damaševičius. Meta-Programming and Model-Driven Meta-Program Development: Principles, Pro- cesses and Techniques. Vol. 5. Springer Science & Business Media, 2012. [Sjö+10] M. Sjölund et al. “Towards Efficient Distributed Simulation in Modelica using Transmission Line Modeling”. In: 3rd In- ternational Workshop on Equation-Based Object-Oriented Modeling Languages and Tools; Oslo; Norway; October 3. 047. Linköping University Electronic Press. 2010, pp. 71–80. [Slo01] A. Slominski. Design of a Pull and Push Parser System for Streaming XML. Tech. rep. Technical Report TR-550, Indiana University, 2001. [Sma12] Smart Grid Coordination Group, CEN-CENELEC-ETSI. Smart Grid Reference Architecture. Nov. 2012. url: https: //ec.europa.eu/energy/sites/ener/files/documents/ xpert_group1_reference _architecture.pdf (visited on 10/21/2019). [Smi15] K. Smith. Cython: A Guide for Python Programmers. O’Reilly Media, 2015. isbn: 9781491901755. url: https : // books . google.de/books?id=ERFkBgAAQBAJ. [Spe] O. van der Spek. C++ CTemplate system. url: https://gi thub.com/OlafvdSpek/ctemplate (visited on 12/23/2019). [SRS10] R. Santodomingo, J. Rodrıguez-Mondéjar, and M. Sanz-Bobi. “Ontology Matching Approach to the Harmonization of CIM and IEC 61850 Standards”. In: Smart Grid Communications (SmartGridComm), 2010 First IEEE International Confer- ence on. IEEE. 2010, pp. 55–60.

237 Bibliography

[SST11] S. Schütte, S. Scherfke, and M. Tröschel. “Mosaik: A frame- work for modular simulation of active components in Smart Grids”. In: Smart Grid Modeling and Simulation (SGMS), 2011 IEEE First International Workshop on. IEEE. 2011, pp. 55–60. [Ste+17] M. Stevic et al. “Multi-site European framework for real- time co-simulation of power systems”. In: IET Generation, Transmission & Distribution 11.17 (2017), pp. 4126–4135. issn: 1751-8687. doi: 10.1049/iet-gtd.2016.1576. [STM10] L. Surhone, M. Timpledon, and S. Marseken. Template Pro- cessor. Betascript Publishing, 2010. isbn: 9786130536886. [Sto+13] J. Stoer et al. Introduction to Numerical Analysis. Texts in Applied Mathematics. Springer New York, 2013. isbn: 9781475722727. [Sup] SuperLU developers. SuperLU: Home Page. url: https:// portal.nersc.gov/project/sparse/superlu (visited on 12/24/2019). [SV01] Y. Saad and H. A. Van Der Vorst. “Iterative Solution of Linear Systems in the 20th Century”. In: Numerical Analysis: Historical Developments in the 20th Century. Elsevier, 2001, pp. 175–207. [SWD15] R. Sedgewick, K. Wayne, and R. Dondero. Introduction to Programming in Python: An Interdisciplinary Approach. Pear- son Education, 2015. isbn: 9780134076522. url: https:// introcs.cs.princeton.edu/python/appendix_numpy/ (vis- ited on 02/09/2020). [Tad] Tadeck. Why is Python 3 not backwards compatible? url: htt ps://stackoverflow.com/questions/9066956/why-is-pyt hon-3-not-backwards-compatible (visited on 02/09/2020). [Tan09] A. Tanenbaum. Modern Operating Systems. Pearson Prentice Hall, 2009. isbn: 9780138134594. [Thi19] B. Thiele. GitHub - modelica-3rdparty/Modelica_Device Drivers. 2019. url: https://github.com/modelica-3rd party/Modelica_DeviceDrivers (visited on 10/23/2019). [Til01] M. Tiller. Introduction to physical modeling with Modelica. Boston: Kluwer Academic Publishers, 2001. [Tri] Trilinos developers. GitHub - trilinos/Trilinos. url: https: //github.com/trilinos/Trilinos (visited on 02/23/2020).

238 [TW67] W. F. Tinney and J. W. Walker. “Direct Solutions of Sparse Network Equations by Optimally Ordered Triangular Fac- torization”. In: Proceedings of the IEEE 55.11 (Nov. 1967), pp. 1801–1809. issn: 0018-9219. doi: 10.1109/PROC.1967. 6011. [Ull75] J. D. Ullman. “NP-Complete Scheduling Problems”. In: Jour- nal of Computer and System sciences 10.3 (1975), pp. 384– 393. [Umw19] Umweltbundesamt, German. Erneuerbare Energien in Deutschland – Daten zur Entwicklung im Jahr 2018. 2019. url: https://www.umweltbundesamt.de/sites/default/ files/medien/1410/publikationen/uba_hgp_eeinzahlen_ 2019_bf.pdf (visited on 10/21/2019). [Uni17] University of Tennessee, Knoxville. BLAS (Basic Linear Alge- bra Subprograms). 2017. url: http://www.netlib.org/blas/ (visited on 10/21/2019). [Uni19] University of Tennessee, Knoxville et al. LAPACK – Linear Algebra PACKage. 2019. url: http : / / www . netlib . org / lapack/ (visited on 10/21/2019). [Usl+12] M. Uslar et al. The Common Information Model CIM: IEC 61968/61970 and 62325 – A practical introduction to the CIM. Power Systems. Springer Berlin Heidelberg, 2012. isbn: 9783642252150. url: https://books.google.de/books?id= cdw6gtzwc-QC. [Van] J. VanderPlas. Why Python is Slow: Looking Under the Hood | Pythonic Perambulations. url: https://jakevdp.github. io/blog/2014/05/09/why-python-is-slow (visited on 02/10/2020). [Var+11] E. Varnik et al. “Fast Conservative Estimation of Hessian Sparsity”. In: Fifth SIAM Workshop on Combinatorial Sci- entific Computing, May 19–21, 2011, Darmstadt, Germany. May 2011, pp. 18–21. [VCV11] S. Van Der Walt, S. C. Colbert, and G. Varoquaux. “The NumPy Array: A Structure for Efficient Numerical Compu- tation”. In: Computing in Science & Engineering 13.2 (2011), p. 22.

239 Bibliography

[Vir+17] R. Viruez et al. “A Modelica-based Tool for Power System Dynamic Simulations”. In: Proceedings of the 12th Interna- tional Modelica Conference, Prague, Czech Republic, May 15-17, 2017. 132. Linköping University Electronic Press. 2017, pp. 235–239. [Vog+17] S. Vogel et al. “An Open Solution for Next-generation Real- time Power System Simulation”. In: 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2). Nov. 2017, pp. 1–6. doi: 10.1109/EI2.2017.8245739. [Wal+14] M. Walther et al. “Equation based parallelization of Modelica models”. In: Proceedings of the 10 th International Modelica Conference; March 10-12; 2014; Lund; Sweden. 096. Linköping University Electronic Press. 2014, pp. 1213–1220. [WB87] M. Wolfe and U. Banerjee. “Data dependence and its appli- cation to parallel processing”. In: International Journal of Parallel Programming 16.2 (Apr. 1987), pp. 137–178. issn: 1573-7640. [Wei+07] S. Wei et al. “Multi-Agent Architecture of Energy Manage- ment System Based on IEC 61970 CIM”. In: Power Engineer- ing Conference, 2007. IPEC 2007. International. IEEE. 2007, pp. 1366–1370. [WGG10] K. Wehrle, M. Günes, and J. Gross. Modeling and tools for network simulation. Springer Science & Business Media, 2010. [WH16] Z. Wang and Y. He. “Two-stage optimal demand response with battery energy storage systems”. In: IET Generation, Transmission & Distribution 10.5 (2016), pp. 1286–1293. [Wil19] A. Williams. C++ Concurrency in Action. Manning Publica- tions Company, 2019. isbn: 9781617294693. [Yan81] M. Yannakakis. “Computing the Minimum Fill-In is NP- Complete”. In: SIAM Journal on Algebraic Discrete Methods 2.1 (1981), pp. 77–79. [ZCN11] K. Zhu, M. Chenine, and L. Nordstrom. “ICT architecture impact on wide area monitoring and control systems’ relia- bility”. In: IEEE transactions on power delivery 26.4 (2011), pp. 2801–2808. [ZMT11] R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas. “MATPOWER: Steady-state operations, planning, and analy- sis tools for power systems research and education”. In: IEEE Transactions on Power Systems 26.1 (2011), pp. 12–19.

240 [ZPK00] B. P. Zeigler, H. Praehofer, and T. G. Kim. Theory of Model- ing and Simulation: Integrating Discrete Event and Continu- ous Complex Dynamic Systems. Academic press, 2000.

241

E.ON ERC Band 1 E.ON ERC Band 7 E.ON ERC Band 13 Streblow, R. Brännström, F. Vogt, C. Thermal Sensation and Einsatz hybrider RANS-LES- Optimization of Geothermal Comfort Model for Turbulenzmodelle in der Energy Reservoir Modeling Inhomogeneous Indoor Fahrzeugklimatisierung using Advanced Numerical Environments 1. Auflage 2012 Tools for Stochastic Parameter 1. Auflage 2011 ISBN 978-3-942789-06-6 Estimation and Quantifying ISBN 978-3-942789-00-4 Uncertainties E.ON ERC Band 8 1. Auflage 2013 E.ON ERC Band 2 Bragard, M. ISBN 978-3-942789-12-7 Naderi, A. The Integrated Emitter Turn- Multi-phase, multi-species Off Thyristor - An Innovative E.ON ERC Band 14 reactive transport modeling as MOS-Gated High-Power Benigni, A. a tool for system analysis in Device Latency exploitation for geological carbon dioxide 1. Auflage 2012 parallelization of storage ISBN 978-3-942789-07-3 power systems simulation 1. Auflage 2011 1. Auflage 2013 ISBN 978-3-942789-01-1 E.ON ERC Band 9 ISBN 978-3-942789-13-4 Hoh, A. E.ON ERC Band 3 Exergiebasierte Bewertung E.ON ERC Band 15 Westner, G. gebäudetechnischer Anlagen Butschen, T. Four Essays related to Energy 1. Auflage 2013 Dual-ICT – A Clever Way to Economic Aspects of ISBN 978-3-942789-08-0 Unite Conduction and Combined Heat and Power Switching Optimized Generation E.ON ERC Band 10 Properties in a Single Wafer 1. Auflage 2012 Köllensperger, P. 1. Auflage 2013 ISBN 978-3-942789-02-8 The Internally Commutated ISBN 978-3-942789-14-1 Thyristor - Concept, Design E.ON ERC Band 4 and Application E.ON ERC Band 16 Lohwasser, R. 1. Auflage 2013 Li, W. Impact of Carbon Capture and ISBN 978-3-942789-09-7 Fault Detection and Storage (CCS) on the European Protection inMedium Electricity Market E.ON ERC Band 11 Voltage DC Shipboard 1. Auflage 2012 Achtnicht, M. Power Systems ISBN 978-3-942789-03-5 Essays on Consumer Choices 1. Auflage 2013 Relevant to Climate Change: ISBN 978-3-942789-15-8 E.ON ERC Band 5 Stated Preference Evidence Dick, C. from Germany E.ON ERC Band 17 Multi-Resonant Converters as 1. Auflage 2013 Shen, J. Photovoltaic Module- ISBN 978-3-942789-10-3 Modeling Methodologies for Integrated Maximum Power Analysis and Synthesis of Point Tracker E.ON ERC Band 12 Controls and Modulation 1. Auflage 2012 Panašková, J. Schemes for High-Power ISBN 978-3-942789-04-2 Olfaktorische Bewertung von Converters with Low Pulse Emissionen aus Bauprodukten Ratios E.ON ERC Band 6 1. Auflage 2013 1. Auflage 2014 Lenke, R. ISBN 978-3-942789-11-0 ISBN 978-3-942789-16-5 A Contribution to the Design of Isolated DC-DC Converters for Utility Applications 1. Auflage 2012 ISBN 978-3-942789-05-9 E.ON ERC Band 18 E.ON ERC Band 24 E.ON ERC Band 30 Flieger, B. Rosen, C. Togawa, K. Innenraummodellierung einer Design considerations and Stochastics-based Methods Fahrzeugkabine functional analysis of local Enabling Testing of Grid- in der Programmiersprache reserve energy markets for related Algorithms through Modelica distributed generation Simulation 1. Auflage 2014 1. Auflage 2014 1. Auflage 2015 ISBN 978-3-942789-17-2 ISBN 978-3-942789-23-3 ISBN 978-3-942789-29-5

E.ON ERC Band 19 E.ON ERC Band 25 E.ON ERC Band 31 Liu, J. Ni, F. Huchtemann, K. Measurement System and Applications of Arbitrary Supply Temperature Control Technique for Future Active Polynomial Chaos in Electrical Concepts in Heat Pump Distribution Grids Systems Heating Systems 1. Auflage 2014 1. Auflage 2015 1. Auflage 2015 ISBN 978-3-942789-18-9 ISBN 978-3-942789-24-0 ISBN 978-3-942789-30-1

E.ON ERC Band 20 E.ON ERC Band 26 E.ON ERC Band 32 Kandzia, C. Michelsen, C. C. Molitor, C. Experimentelle Untersuchung The Energiewende in the Residential City Districts as der Strömungsstrukturen in German Residential Sector: Flexibility Resource: Analysis, einer Mischlüftung Empirical Essays on Simulation, and Decentralized 1. Auflage 2014 Homeowners’ Choices of Coordination Algorithms ISBN 978-3-942789-19-6 Space Heating Technologies 1. Auflage 2015 1. Auflage 2015 ISBN 978-3-942789-31-8 E.ON ERC Band 21 ISBN 978-3-942789-25-7 Thomas, S. E.ON ERC Band 33 A Medium-Voltage Multi- E.ON ERC Band 27 Sunak, Y. Level DC/DC Converter with Rohlfs, W. Spatial Perspectives on the High Voltage Transformation Decision-Making under Multi- Economics of Renewable Ratio Dimensional Price Uncertainty Energy Technologies 1. Auflage 2014 for Long-Lived Energy 1. Auflage 2015 ISBN 978-3-942789-20-2 Investments ISBN 978-3-942789-32-5 1. Auflage 2015 E.ON ERC Band 22 ISBN 978-3-942789-26-4 E.ON ERC Band 34 Tang, J. Cupelli, M. Probabilistic Analysis and E.ON ERC Band 28 Advanced Control Methods for Stability Assessment for Power Wang, J. Robust Stability of MVDC Systems with Integration of Design of Novel Control Systems Wind Generation and algorithms of Power 1. Auflage 2015 Synchrophasor Measurement Converters for Distributed ISBN 978-3-942789-33-2 1. Auflage 2014 Generation ISBN 978-3-942789-21-9 1. Auflage 2015 E.ON ERC Band 35 ISBN 978-3-942789-27-1 Chen, K. E.ON ERC Band 23 Active Thermal Management Sorda, G. E.ON ERC Band 29 for Residential Air Source Heat The Diffusion of Selected Helmedag, A. Pump Systems Renewable Energy System-Level Multi-Physics 1. Auflage 2015 Technologies: Modeling, Power Hardware in the Loop ISBN 978-3-942789-34-9 Economic Impacts, and Policy Testing for Wind Energy Implications Converters 1. Auflage 2014 1. Auflage 2015 ISBN 978-3-942789-22-6 ISBN 978-3-942789-28-8

E.ON ERC Band 36 E.ON ERC Band 42 E.ON ERC Band 48 Pâques, G. Huber, M. Kopmann, N. Development of SiC GTO Agentenbasierte Betriebsverhalten freier Thyristors with Etched Gebäudeautomation für Heizflächen unter zeitlich Junction Termination raumlufttechnische Anlagen variablen Randbedingungen 1. Auflage 2016 1. Auflage 2016 1. Auflage 2017 ISBN 978-3-942789-35-6 ISBN 978-3-942789-41-7 ISBN 978-3-942789-47-9

E.ON ERC Band 37 E.ON ERC Band 43 E.ON ERC Band 49 Garnier, E. Soltau, N. Fütterer, J. Distributed Energy Resources High-Power Medium-Voltage Tuning of PID Controllers and Virtual Power Plants: DC-DC Converters: Design, within Building Energy Economics of Investment and Control and Demonstration Systems Operation 1. Auflage 2017 1. Auflage 2017 1. Auflage 2016 ISBN 978-3-942789-42-4 ISBN 978-3-942789-48-6 ISBN 978-3-942789-37-0 E.ON ERC Band 44 E.ON ERC Band 50 E.ON ERC Band 38 Stieneker, M. Adler, F. Calì, D. Analysis of Medium-Voltage A Digital Hardware Platform Occupants' Behavior and its Direct-Current Collector Grids for Distributed Real-Time Impact upon the Energy in Offshore Wind Parks Simulation of Power Electronic Performance of Buildings 1. Auflage 2017 Systems 1. Auflage 2016 ISBN 978-3-942789-43-1 1. Auflage 2017 ISBN 978-3-942789-36-3 ISBN 978-3-942789-49-3 E.ON ERC Band 45 E.ON ERC Band 39 Bader, A. E.ON ERC Band 51 Isermann, T. Entwicklung eines Verfahrens Harb, H. A Multi-Agent-based zur Strompreisvorhersage im Predictive Demand Side Component Control and kurzfristigen Intraday- Management Strategies for Energy Management System Handelszeitraum Residential Building Energy for Electric Vehicles 1. Auflage 2017 Systems 1. Auflage 2016 ISBN 978-3-942789-44-8 1. Auflage 2017 ISBN 978-3-942789-38-7 ISBN 978-3-942789-50-9 E.ON ERC Band 46 E.ON ERC Band 40 Chen, T. E.ON ERC Band 52 Wu, X. Upscaling Permeability for Jahangiri, P. New Approaches to Dynamic Fractured Porous Rocks and Applications of Paraffin-Water Equivalent of Active Modeling Anisotropic Flow Dispersions in Energy Distribution Network for and Heat Transport Distribution Systems Transient Analysis 1. Auflage 2017 1. Auflage 2017 1. Auflage 2016 ISBN 978-3-942789-45-5 ISBN 978-3-942789-51-6 ISBN 978-3-942789-39-4 E.ON ERC Band 47 E.ON ERC Band 53 E.ON ERC Band 41 Ferdowsi, M. Adolph, M. Garbuzova-Schiftler, M. Data-Driven Approaches for Identification of Characteristic The Growing ESCO Market for Monitoring of Distribution User Behavior with a Simple Energy Efficiency in Russia: A Grids User Interface in the Context of Business and Risk Analysis 1. Auflage 2017 Space Heating 1. Auflage 2016 ISBN 978-3-942789-46-2 1. Auflage 2018 ISBN 978-3-942789-40-0 ISBN 978-3-942789-52-3

E.ON ERC Band 54 E.ON ERC Band 60 E.ON ERC Band 66 Galassi, V. Lauster, M. Khayyamim, S. Experimental evidence of Parametrierbare Centralized-decentralized private energy consumer and Gebäudemodelle für Energy Management in prosumer preferences in the dynamische Railway System sustainable energy transition Energiebedarfsrechnungen von 1. Auflage 2019 1. Auflage 2017 Stadtquartieren ISBN 978-3-942789-65-3 ISBN 978-3-942789-53-0 1. Auflage 2018 ISBN 978-3-942789-59-2 E.ON ERC Band 67 E.ON ERC Band 55 Schlösser, T. Sangi, R. E.ON ERC Band 61 Methodology for Holistic Development of Exergy-based Zhu, L. Evaluation of Building Energy Control Strategies for Building Modeling, Control and Systems under Dynamic Energy Systems Hardware in the Loop in Boundary Conditions 1. Auflage 2018 Medium Voltage DC 1. Auflage 2019 ISBN 978-3-942789-54-7 Shipboard Power Systems ISBN 978-3-942789-66-0 1. Auflage 2018 E.ON ERC Band 56 ISBN 978-3-942789-60-8 E.ON ERC Band 68 Stinner, S. Cui, S. Quantifying and Aggregating E.ON ERC Band 62 Modular Multilevel DC-DC the Flexibility of Building Feron, B. Converters Interconnecting Energy Systems An optimality assessment High-Voltage and Medium- 1. Auflage 2018 methodology for Home Energy Voltage DC Grids ISBN 978-3-942789-55-4 Management System 1. Auflage 2019 approaches based on ISBN 978-3-942789-67-7 E.ON ERC Band 57 uncertainty analysis Fuchs, M. 1. Auflage 2018 E.ON ERC Band 69 Graph Framework for ISBN 978-3-942789-61-5 Hu, J. Automated Urban Energy Modulation and Dynamic System Modeling E.ON ERC Band 63 Control of Intelligent Dual- 1. Auflage 2018 Diekerhof, M. Active-Bridge Converter Based ISBN 978-3-942789-56-1 Distributed Optimization for Substations for Flexible DC the Exploitation of Multi- Grids E.ON ERC Band 58 Energy Flexibility under 1. Auflage 2019 Osterhage, T. Uncertainty in City Districts ISBN 978-3-942789-68-4 Messdatengestützte Analyse 1. Auflage 2018 und Interpretation ISBN 978-3-942789-62-2 E.ON ERC Band 70 sanierungsbedingter Schiefelbein, J. Effizienzsteigerungen im E.ON ERC Band 64 Optimized Placement of Wohnungsbau Wolisz, H. Thermo-Electric Energy 1. Auflage 2018 Transient Thermal Comfort Systems in City Districts under ISBN 978-3-942789-57-8 Constraints for Model Uncertainty Predictive Heating Control 1. Auflage 2019 E.ON ERC Band 59 1. Auflage 2018 ISBN 978-3-942789-69-1 Frieling, J. ISBN 978-3-942789-63-9 Quantifying the Role of Energy E.ON ERC Band 71 in Aggregate Production E.ON ERC Band 65 Ferdinand, R. Functions for Industrialized Pickartz, S. Grid Operation of HVDC- Countries Virtualization as an Enabler for Connected Offshore Wind 1. Auflage 2018 Dynamic Resource Allocation Farms: Power Quality and ISBN 978-3-942789-58-5 in HPC Switching Strategies 1. Auflage 2019 1. Auflage 2019 ISBN 978-3-942789-64-6 ISBN 978-3-942789-70-7

E.ON ERC Band 72 E.ON ERC Band 77 Musa, A. Heesen, F. Advanced Control Strategies An Interdisciplinary Analysis for Stability Enhancement of of Heat Energy Consumption Future Hybrid AC/DC in Energy-Efficient Homes: Networks Essays on Economic, Technical 1. Auflage 2019 and Behavioral Aspects ISBN 978-3-942789-71-4 1. Auflage 2019 ISBN 978-3-942789-76-9 E.ON ERC Band 73 Angioni, A. E.ON ERC Band 78 Uncertainty modeling for Möller, R. analysis and design of Untersuchung der monitoring systems for Durchschlagspannung von dynamic electrical distribution Mineral-, Silikonölen und grids synthetischen Estern bei 1. Auflage 2019 mittelfrequenten Spannungen ISBN 978-3-942789-72-1 1. Auflage 2020 ISBN 978-3-942789-77-6 E.ON ERC Band 74 Möhlenkamp, M. E.ON ERC Band 79 Thermischer Komfort bei Höfer, T. Quellluftströmungen Transition Towards a 1. Auflage 2019 Renewable Energy ISBN 978-3-942789-73-8 Infrastructure: Spatial Interdependencies and Stake- E.ON ERC Band 75 holder Preferences Voss, J. 1. Auflage 2020 Multi-Megawatt Three-Phase ISBN 978-3-942789-78-3 Dual-Active Bridge DC-DC Converter E.ON ERC Band 80 1. Auflage 2019 Freitag, H. ISBN 978-3-942789-74-5 Investigation of the Internal Flow Behavior in Active E.ON ERC Band 76 Chilled Beams Siddique, H. 1. Auflage 2020 The Three-Phase Dual-Active ISBN 978-3-942789-79-0 Bridge Converter Family: Modeling, Analysis, Optimization and Comparison of Two-Level and Three-Level Converter Variants 1. Auflage 2019 ISBN 978-3-942789-75-2

This dissertation deals with established and newly developed methods from the field of high-performance computing (HPC) and computer science which were implemented in existing and new software that can be used for the simulation of large-scale power systems. The motivation for this is the transformation from conventional power grids to smart grids due to the growing share of renewable energies that require a more complex power grid management. The presented HPC methods make it possible to use the potential of modern computer hard- ware which, for example, comes along with more and more parallel computing units or decreasing latencies in network communication that can be of decisive importance especially for real-time applications. In addition to measures for the optimization of hardware utilization the dissertation also deals with the represen- tation of power systems. In the simulation of smart grids, this includes not only the power grid but also, for instance, the associated communication network and the energy market. Therefore, a data model for smart grid topologies based on existing standards is introduced and validated in a co-simulation environ- ment. In addition, an approach is presented that automatedly generates a soft- ware library from the specification of the data model. Subsequently, an appro- ach is shown which uses the library for converting topological data into various simulator-specific system models. All presented approaches were implemented in open-source software projects, accessible by the public.

ISBN 978-3-942789-80-6

81