BCL: a Cross-Platform Distributed Data Structures Library

Total Page:16

File Type:pdf, Size:1020Kb

BCL: a Cross-Platform Distributed Data Structures Library BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT a matching receive operation or outside of a global collective. How- One-sided communication is a useful paradigm for irregular paral- ever, PGAS languages lack the kind of high level libraries that exist lel applications, but most one-sided programming environments, in other popular programming environments. For example, high- including MPI’s one-sided interface and PGAS programming lan- performance scientific simulations written in MPI can leverage a guages, lack application-level libraries to support these applica- broad set of numerical libraries for dense or sparse matrices, or tions. We present the Berkeley Container Library, a set of generic, for structured, unstructured, or adaptive meshes. PGAS languages cross-platform, high-performance data structures for irregular ap- can sometimes use those numerical libraries, but lack the kind of plications, including queues, hash tables, Bloom filters and more. data structures that are important in some of the most irregular BCL is written in C++ using an internal DSL called the BCL Core parallel programs. that provides one-sided communication primitives such as remote In this paper we describe a library, the Berkeley Container Li- get and remote put operations. The BCL Core has backends for brary (BCL) that is intended to support applications with irregular MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data patterns of communication and computation and data structures structures to be used natively in programs written using any of with asynchronous access, for example hash tables and queues, that these programming environments. Along with our internal DSL, can be distributed across processes but manipulated independently we present the BCL ObjectContainer abstraction, which allows BCL by each process. BCL is designed to provide a complementary set data structures to transparently serialize complex data types while of abstractions for data analytics problems, various types of search maintaining efficiency for primitive types. We also introduce the algorithms, and other applications that do not easily fit a bulk- set of BCL data structures and evaluate their performance across a synchronous model. BCL is written in C++ and its data structures number of high-performance computing systems, demonstrating are designed to be coordination free, using one-sided communica- that BCL programs are competitive with hand-optimized code, even tion primitives that can be executed using RDMA hardware with- while hiding many of the underlying details of message aggregation, out requiring coordination with remote CPUs. In this way, BCL is serialization, and synchronization. consistent with the spirit of PGAS languages, but provides higher level operations such as insert and find in a hash table, rather than CCS CONCEPTS low-level remote read and write. As in PGAS languages, BCL data structures live in a global address space and can be accessed by • Computing methodologies → Parallel programming lan- every process in a parallel program. BCL data structures are also guages. partitioned to ensure good locality whenever possible and allow for scalable implementations across multiple nodes with physically KEYWORDS disjoint memory. Parallel Programming Libraries, RDMA, Distributed Data Structures BCL is cross-platform, and is designed to be agnostic about the ACM Reference Format: underlying communication layer as long as it provides one-sided Benjamin Brock, Aydın Buluç, Katherine Yelick. 2019. BCL: A Cross-Platform communication primitives. It runs on top of MPI’s one-sided com- Distributed Data Structures Library . In Proceedings of ACM Conference munication primitives, OpenSHMEM, and GASNet-EX, all of which (Conference’17). ACM, New York, NY, USA, 13 pages. https://doi.org/10. provide direct access to low-level remote read and write primitives 1145/nnnnnnn.nnnnnnn to buffers in memory [6, 9, 17]. BCL provides higher level abstrac- arXiv:1810.13029v2 [cs.DC] 23 Apr 2019 tions than these communication layers, hiding many of the details 1 INTRODUCTION of buffering, aggregation, and synchronization from users that are Writing parallel programs for supercomputers is notoriously diffi- specific to a given data structure. BCL also has an experimental cult, particularly when they have irregular control flow and complex UPC++ backend, allowing BCL data structures to be used inside an- data distribution; however, high-level languages and libraries can other high-level programming environment. BCL uses a high-level make this easier. A number of languages have been developed for data serialization abstraction called ObjectContainers to allow the high-performance computing, including several using the Parti- storage of arbitrarily complex datatypes inside BCL data structures. tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray BCL ObjectContainers use C++ compile-time type introspection to Fortran, X10, and Chapel [8, 11, 12, 27, 31, 32]. These languages are avoid introducing any overhead in the common case that types are especially well-suited to problems that require asynchronous one- byte-copyable. sided communication, or communication that takes place without We present the design of BCL with an initial set of data struc- tures and operations. We then evaluate BCL’s performance on ISx, Conference’17, July 2017, Washington, DC, USA an integer sorting mini-application, Meraculous, a mini-application 2019. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn taken from a large-scale genomics application, and a collection of Conference’17, July 2017, Washington, DC, USA Benjamin Brock, Aydın Buluç, Katherine Yelick microbenchmarks examining the performance of individual data MPI structure operations. We explain how BCL’s data structures and OpenSHMEM design decisions make developing high-performance implementa- Internal DSL BCL Containers tions of these benchmarks more straightforward and demonstrate (BCL Core) that BCL is able to match or exceed the performance of both spe- GASNet-EX cialized, expert-tuned implementations as well as general libraries UPC++ across three different HPC systems. Figure 1: Organizational diagram of BCL. 1.1 Contributions (1) A distributed data structures library that is designed for high inserts, and finds, are executed purely in RDMA without requiring performance and portability by using a small set of core primi- coordination with remote CPUs. tives that can be executed on four distributed memory backends (2) The BCL ObjectContainer abstraction, which allows data struc- Portability. BCL is cross-platform and can be used natively in tures to transparently handle serialization of complex types programs written in MPI, OpenSHMEM, GASNet-EX, and UPC++. while maintaining high performance for simple types When programs only use BCL data structures, users can pick whichever (3) A distributed hash table implementation that supports fast in- backend’s implementation is most optimized for their system and sertion and lookup phases, dynamic message aggregation, and network hardware. individual insert and find operations Software Toolchain Complexity. BCL is a header-only library, so (4) A distributed queue abstraction for many-to-many data ex- users need only include the appropriate header files and compile changes performed without global synchronization with a C++-14 compliant compiler to build a BCL program. BCL (5) A distributed Bloom filter which achieves fully atomic insertions data structures can be used in part of an application without having using only one-sided operations to re-write the whole application or include any new dependencies. (6) A collection of distributed data structures that offer variable levels of atomicity depending on the call context using an ab- 3 BCL CORE straction called concurrency promises (7) A fast and portable implementation of the Meraculous bench- The BCL Core is the cross-platform internal DSL we use to imple- mark built in BCL ment BCL data structures. It provides a high-level PGAS memory (8) An experimental analysis of irregular data structures across model based on global pointers, which are C++ objects that allow three different computing systems along with comparisons be- the manipulation of remote memory. During initialization, each tween BCL and other standard implementations. process creates a shared memory segment of a fixed size, and every process can read and write to any location within another node’s shared segment using global pointers. A global pointer is a C++ ob- 2 BACKGROUND AND HIGH-LEVEL DESIGN ject that contains (1) the rank of the process on which the memory Several approaches have been used to address programmability is located and (2) the particular offset within that process’ shared issues in high-performance computing, including parallel languages memory segment that is being referenced. Together, these two val- like Chapel, template metaprogramming libraries like UPC++, and ues uniquely identify a global memory address. Global pointers are embedded DSLs like STAPL. These environments provide core regular data objects and can be passed around between BCL pro- language abstractions that can boost productivity, and some of them cesses using communication primitives or stored in global memory. have sophisticated support for multidimensional arrays. However,
Recommended publications
  • A Trusted Mechanised Javascript Specification
    A Trusted Mechanised JavaScript Specification Martin Bodin Arthur Charguéraud Daniele Filaretti Inria & ENS Lyon Inria & LRI, Université Paris Sud, CNRS Imperial College London [email protected] [email protected] d.fi[email protected] Philippa Gardner Sergio Maffeis Daiva Naudžiunien¯ e˙ Imperial College London Imperial College London Imperial College London [email protected] sergio.maff[email protected] [email protected] Alan Schmitt Gareth Smith Inria Imperial College London [email protected] [email protected] Abstract sation was crucial. Client code that works on some of the main JavaScript is the most widely used web language for client-side ap- browsers, and not others, is not useful. The first official standard plications. Whilst the development of JavaScript was initially just appeared in 1997. Now we have ECMAScript 3 (ES3, 1999) and led by implementation, there is now increasing momentum behind ECMAScript 5 (ES5, 2009), supported by all browsers. There is the ECMA standardisation process. The time is ripe for a formal, increasing momentum behind the ECMA standardisation process, mechanised specification of JavaScript, to clarify ambiguities in the with plans for ES6 and 7 well under way. ECMA standards, to serve as a trusted reference for high-level lan- JavaScript is the only language supported natively by all major guage compilation and JavaScript implementations, and to provide web browsers. Programs written for the browser are either writ- a platform for high-assurance proofs of language properties. ten directly in JavaScript, or in other languages which compile to We present JSCert, a formalisation of the current ECMA stan- JavaScript.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • BCL: a Cross-Platform Distributed Data Structures Library
    BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT high-performance computing, including several using the Parti- One-sided communication is a useful paradigm for irregular paral- tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray lel applications, but most one-sided programming environments, Fortran, X10, and Chapel [9, 11, 12, 25, 29, 30]. These languages are including MPI’s one-sided interface and PGAS programming lan- especially well-suited to problems that require asynchronous one- guages, lack application-level libraries to support these applica- sided communication, or communication that takes place without tions. We present the Berkeley Container Library, a set of generic, a matching receive operation or outside of a global collective. How- cross-platform, high-performance data structures for irregular ap- ever, PGAS languages lack the kind of high level libraries that exist plications, including queues, hash tables, Bloom filters and more. in other popular programming environments. For example, high- BCL is written in C++ using an internal DSL called the BCL Core performance scientific simulations written in MPI can leverage a that provides one-sided communication primitives such as remote broad set of numerical libraries for dense or sparse matrices, or get and remote put operations. The BCL Core has backends for for structured, unstructured, or adaptive meshes. PGAS languages MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data can sometimes use those numerical libraries, but are missing the structures to be used natively in programs written using any of data structures that are important in some of the most irregular these programming environments.
    [Show full text]
  • Advances, Applications and Performance of The
    1 Introduction The two predominant classes of programming models for MIMD concurrent computing are distributed memory and ADVANCES, APPLICATIONS AND shared memory. Both shared memory and distributed mem- PERFORMANCE OF THE GLOBAL ory models have advantages and shortcomings. Shared ARRAYS SHARED MEMORY memory model is much easier to use but it ignores data PROGRAMMING TOOLKIT locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this character- istic can have a negative impact on performance and scal- Jarek Nieplocha1 1 ability. Careful code restructuring to increase data reuse Bruce Palmer and replacing fine grain load/stores with block access to Vinod Tipparaju1 1 shared data can address the problem and yield perform- Manojkumar Krishnan ance for shared memory that is competitive with mes- Harold Trease1 2 sage-passing (Shan and Singh 2000). However, this Edoardo Aprà performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distrib- uted memory models, such as message-passing or one- Abstract sided communication, offer performance and scalability This paper describes capabilities, evolution, performance, but they are difficult to program. and applications of the Global Arrays (GA) toolkit. GA was The Global Arrays toolkit (Nieplocha, Harrison, and created to provide application programmers with an inter- Littlefield 1994, 1996; Nieplocha et al. 2002a) attempts to face that allows them to distribute data while maintaining offer the best features of both models. It implements a the type of global index space and programming syntax shared-memory programming model in which data local- similar to that available when programming on a single ity is managed by the programmer.
    [Show full text]
  • Red Hat Developer Toolset 9 User Guide
    Red Hat Developer Toolset 9 User Guide Installing and Using Red Hat Developer Toolset Last Updated: 2020-08-07 Red Hat Developer Toolset 9 User Guide Installing and Using Red Hat Developer Toolset Zuzana Zoubková Red Hat Customer Content Services Olga Tikhomirova Red Hat Customer Content Services [email protected] Supriya Takkhi Red Hat Customer Content Services Jaromír Hradílek Red Hat Customer Content Services Matt Newsome Red Hat Software Engineering Robert Krátký Red Hat Customer Content Services Vladimír Slávik Red Hat Customer Content Services Legal Notice Copyright © 2020 Red Hat, Inc. The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/ . In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp.
    [Show full text]
  • Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters
    Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models
    Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models P. Saddayappan2, Bruce Palmer1, Manojkumar Krishnan1, Sriram Krishnamoorthy1, Abhinav Vishnu1, Daniel Chavarría1, Patrick Nichols1, Jeff Daily1 1Pacific Northwest National Laboratory 2Ohio State University Outline of the Tutorial ! " Parallel programming models ! " Global Arrays (GA) programming model ! " GA Operations ! " Writing, compiling and running GA programs ! " Basic, intermediate, and advanced calls ! " With C and Fortran examples ! " GA Hands-on session 2 Performance vs. Abstraction and Generality Domain Specific “Holy Grail” Systems GA CAF MPI OpenMP Scalability Autoparallelized C/Fortran90 Generality 3 Parallel Programming Models ! " Single Threaded ! " Data Parallel, e.g. HPF ! " Multiple Processes ! " Partitioned-Local Data Access ! " MPI ! " Uniform-Global-Shared Data Access ! " OpenMP ! " Partitioned-Global-Shared Data Access ! " Co-Array Fortran ! " Uniform-Global-Shared + Partitioned Data Access ! " UPC, Global Arrays, X10 4 High Performance Fortran ! " Single-threaded view of computation ! " Data parallelism and parallel loops ! " User-specified data distributions for arrays ! " Compiler transforms HPF program to SPMD program ! " Communication optimization critical to performance ! " Programmer may not be conscious of communication implications of parallel program HPF$ Independent HPF$ Independent s=s+1 DO I = 1,N DO I = 1,N A(1:100) = B(0:99)+B(2:101) HPF$ Independent HPF$ Independent HPF$ Independent
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • The Complete Guide to Return X;
    The Complete Guide to return x; I also do C++ training! [email protected] Arthur O’Dwyer 2021-05-04 Outline ● The “return slot”; NRVO; C++17 “deferred materialization” [4–23] ● C++11 implicit move [24–29]. Question break. ● Problems in C++11; solutions in C++20 [30–46]. Question break. ● The reference_wrapper saga; pretty tables of vendor divergence [47–55] ● Quick sidebar on coroutines and related topics [56–65]. Question break. ● P2266 proposed for C++23 [66–79]. Questions! Hey look! Slide numbers! 3 x86-64 calling convention int f() { _Z1fv: int i = 42; movl $42, -4(%rsp) return i; movl -4(%rsp), %eax } retq int test() _Z4testv: { callq _Z1fv int j = f(); addl $1, %eax return j + 1; retq } On x86-64, the function’s return value usually goes into the %eax register. 4 x86-64 calling convention Stack Segment int f() { Since f and test each have their own int i = 42; f i printf("%p\n", &i); stack frame, i and j naturally are different return i; prints “0x9ff00020” variables. } test j j is initialized with a int test() { copy of i — C++ : int j = f(); loves copy semantics. : printf("%p\n", &j); : return j + 1; prints “0x9ff00040” } main 5 x86-64 calling convention Stack Segment struct S { int m; }; Even for class types, C++ does “return by f i S f() { copy.” prints “ ” S i = S{42}; 0x9ff00020 The return value is printf("%p\n", &i); still passed in a test j return i; machine register } when possible. : : S test() { : S j = f(); prints “0x9ff00040” printf("%p\n", &j); main return j; } 6 x86-64 calling convention But what about when Stack Segment struct S { int m[3]; }; S is too big to fit in a register? S f() { f i Then x86-64 says that S i = S{{1,3,5}}; the caller should pass printf("%p\n", &i); an extra parameter, return i; pointing to space in } the caller’s own : return slot stack frame big : S test() { enough to hold the test: S j = f(); result.
    [Show full text]
  • Reference Guide for X86-64 Cpus
    REFERENCE GUIDE FOR X86-64 CPUS Version 2019 TABLE OF CONTENTS Preface............................................................................................................. xi Audience Description.......................................................................................... xi Compatibility and Conformance to Standards............................................................ xi Organization....................................................................................................xii Hardware and Software Constraints...................................................................... xiii Conventions....................................................................................................xiii Terms............................................................................................................xiv Related Publications.......................................................................................... xv Chapter 1. Fortran, C, and C++ Data Types................................................................ 1 1.1. Fortran Data Types....................................................................................... 1 1.1.1. Fortran Scalars.......................................................................................1 1.1.2. FORTRAN real(2).....................................................................................3 1.1.3. FORTRAN 77 Aggregate Data Type Extensions.................................................. 3 1.1.4. Fortran 90 Aggregate Data Types (Derived
    [Show full text]
  • Demystifying Value Categories in C++ Icsc 2020
    Demystifying Value Categories in C++ iCSC 2020 Nis Meinert Rostock University Disclaimer Disclaimer → This talk is mainly about hounding (unnecessary) copy ctors → In case you don’t care: “If you’re not at all interested in performance, shouldn’t you be in the Python room down the hall?” (Scott Meyers) Nis Meinert – Rostock University Demystifying Value Categories in C++ 2 / 100 Table of Contents PART I PART II → Understanding References → Dangling References → Value Categories → std::move in the wild → Perfect Forwarding → What Happens on return? → Reading Assembly for Fun and → RVO in Depth Profit → Perfect Backwarding → Implicit Costs of const& Nis Meinert – Rostock University Demystifying Value Categories in C++ 3 / 100 PART I Understanding References Q: What is the output of the programs? 1 #!/usr/bin/env python3 1 #include <iostream> 2 2 3 class S: 3 struct S{ 4 def __init__(self, x): 4 int x; 5 self.x = x 5 }; 6 6 7 def swap(a, b): 7 void swap(S& a, S& b) { 8 b, a = a, b 8 S& tmp = a; 9 9 a = b; 10 if __name__ == '__main__': 10 b = tmp; 11 a, b = S(1), S(2) 11 } 12 swap(a, b) 12 13 print(f'{a.x}{b.x}') 13 int main() { 14 S a{1}; S b{2}; 15 swap(a, b); 16 std::cout << a.x << b.x; 17 } godbolt.org/z/rE6Ecd Nis Meinert – Rostock University Demystifying Value Categories in C++ – Understanding References 4 / 100 Q: What is the output of the programs? A: 12 A: 22 1 #!/usr/bin/env python3 1 #include <iostream> 2 2 3 class S: 3 struct S{ 4 def __init__(self, x): 4 int x; 5 self.x = x 5 }; 6 6 7 def swap(a, b): 7 void swap(S& a, S& b) {
    [Show full text]