Dissecting Flops Along Input Dimensions for Greenai Cost Estimations

Dissecting FLOPs along input dimensions for GreenAI cost estimations Andrea Asperti1, Davide Evangelista2, and Moreno Marzolla1 1 University of Bologna Department of Informatics: Science and Engineering (DISI) 2 University of Bologna Department of Mathematics Abstract. The term GreenAI refers to a novel approach to Deep Learn- ing, that is more aware of the ecological impact and the computational efficiency of its methods. The promoters of GreenAI suggested the use of Floating Point Operations (FLOPs) as a measure of the computational cost of Neural Networks; however, that measure does not correlate well with the energy consumption of hardware equipped with massively parallel processing units like GPUs or TPUs. In this article, we propose a simple refinement of the formula used to compute floating point operations for convolutional layers, called α-FLOPs, explaining and correcting the traditional discrepancy with respect to different layers, and closer to reality. The notion of α-FLOPs relies on the crucial insight that, in case of inputs with multiple dimensions, there is no reason to believe that the speedup offered by parallelism will be uniform along all different axes. 1 Introduction Artificial Intelligence, especially in its modern incarnation of Deep Learning, has achieved remarkable results in recent years, matching { and frequently tres- passing { human capabilities in a number of different tasks. These techniques usually require the deployment of massive computational resources, with huge implications in terms of energy consumption. To make a couple of examples the hyper-realistic Generative Adversarial Network for face generation in [19] required training on 8 Tesla V100 GPUs for 4 days; the training of BERT [12], a well known generative model for NLP, takes about 96 hours on 64 TPU2 chips. Researchers at the University of Massachusetts [26] have recently performed a arXiv:2107.11949v1 [cs.LG] 26 Jul 2021 life cycle assessment relative to the training of large state-of-the-art AI models, discovering that the process can emit a quantity of carbon dioxide roughly equivalent to the lifetime emissions of five medium cars. Other authors reached similar conclusions [20]. Until a few years ago, the ecological impact of artificial intelligence was en- tirely neglected by researchers and industry, who were mostly focused on improv- ing performance at any cost. However, this has changed in recent years, with a growing awareness that this trend of research is not sustainable any more [28], and an increased attention towards energetic efficiency [27]. 2 Andrea Asperti, Davide Evangelista, and Moreno Marzolla The GreenAI paper [25] summarizes well the goal and objectives of the new philosophy: it promotes a new practice in Deep Learning, that is more focused on the social costs of training and running models [2,7,15], encouraging the investigation of increasingly efficient models [21,5]. To this aim, it is essential to identify widely acceptable and reliable metrics to assess and compare the cost and efficiency of different models. Several metrics are investigated and discussed in [25]; in conclusion, the number of Floating Point Operations (FLOPs) is advocated and promoted, since it is easily com- puted for Neural Networks while offering a hardware independent, schematic but meaningful indication of the actual computation cost of the model [20]. Unfortunately, the mere computation of FLOPs does not cope well with the massively parallel architectures (GPU and TPU) typically used in Deep Learning [17]. Efficient implementation of neural networks on these architectures depends both on complex algorithms for General Matrix Multiplication (GEMM) [18] and sophisticated load balancing techniques [13] splitting the workload on the different execution units. As we shall see, these algorithms usually perform better for specific layers and, especially, along specific axes of the input dimension of these layers. Our claim is that it is possible to study the performance of neural layers (especially, convolutions) as \black boxes", measuring the execution time for a number of different configurations, and separately investigating the execution time for increasing dimensions along different axis. As a result, we propose a simple correction to the formula used to compute FLOPs for convolutional layers, that provides better estimations of their actual cost, and helps to understand the discrepancy with respect to the cost of different layers. Organization of the article This paper has the following structure. In Section 2 we briefly discuss some possible metrics for measuring the efficiency of models; we particularly focus on FLOPs, discussing their computation for some basic operations relevant for Neural Networks. In Section 3 we introduce the GEMM (GEneral Matrix Multiply) operation, that helps to understand the canonical computation of FLOPs for the Convolution layers. In Section 4 we present some experiments which show that, if Convolutions are executed on GPU, FLOPs are not a good measure for efficiency. That is the motivation for introducing a correction, that we call α-FLOPs, defined and discussed in Section 5. Section 6 offers more experimental results, validating the formula with respect to growing input dimensions along specific axes. 2 Measures of Efficiency In this section we review some of the metrics that can be used to measure the efficiency of an AI algorithm, following the discussion of [25]. Dissecting FLOPs along input dimensions for GreenAI cost estimations 3 Carbon Emission As already remarked in the introduction, the present work is motivated by the need to reduce the energy consumption of training large state-of-the-art AI models. Unless a significant fraction of such energy comes from renewable sources, reducing the power required for AI training means that less carbon dioxide is released into the atmosphere. Unfortunately, precise quan- tification of carbon emission associated with computational tasks is impractical, since it depends both on the hardware hosting the computation, and also on the local energy production and distribution infrastructure. Number of parameters The number of parameters of a Deep Learning model is an interesting and hardware-independent measure of the complexity of models. Unfortunately, the number of parameters alone is poorly correlated with the total training time, since parameters may refer to different operations. For example, Convolutional Layers have relatively few parameters, relative to the kernel of the convolution; this does not take into account the actual cost of convolving the kernel over the input. Execution time The total running time is a natural measure of efficiency: faster algorithms are better. Execution time depends on the number of instructions executed and hence is strictly correlated with the total energy consumption [24]; therefore, it is a good proxy of power usage when direct energy measurement is impractical. There are a couple of important considerations that must be made when considering execution time as a metric: (i) it requires an implementation of the algorithm being measured, which may take time and effort to be developed; (ii) execution time is hardware- and language-dependent, since it depends on both the underlying hardware and on the efficiency of the compiler/interpreter. FLOPs The number of FLoating Point OPerations (FLOPs) is a metric that is widely used in the context of numerical computations [23,14,29,22]. It is defined as the total count of elementary machine operations (floating point additions and multiplications) executed by a program. Floating point operations have a latency of several CPU cycles on most current processor architectures [10,9,3], although the use of pipelining, multiple-issue and SIMD instructions significantly increase the throughput. In general, floating point operations have higher latency than most of the other CPU instructions (apart from load/stores from/to main memory, where memory access is the bottleneck); therefore, they tend to domi- nate the execution time of numerical algorithms. For this reason, the number of floating point operations is used as a proxy for the execution time of a program. As an example, suppose that v and w are n-dimensional arrays. Then, the inner product between v and w n X hv; wi = viwi (1) i=1 4 Andrea Asperti, Davide Evangelista, and Moreno Marzolla requires n multiplications and n − 1 additions, for a total of 2n − 1 FLOPs. Similarly, the matrix-vector product between an m × n matrix A and an n- dimensional vector v requires m inner product, for a total of 2mn − m FLOPs. Since operations similar to (1), where a sequence of multiplications are added together, are very common, modern CPUs supports FMA (Fused Multiply-Add) instructions, where a multiplication followed by an addition are executed as a single operation and require less time than two separate instructions. For this reason, the definition of FLOPs is usually modified to be the total number of FMA operations needed for a full iteration of an algorithm. With this definition (that it is usually followed by some authors), the inner product of two n-dimensional arrays requires n FLOPs, while the product between an m × n matrix with an n-dimensional vector requires nm FLOPs. Nonetheless, since we are interested in measuring the performance under massively parallel architectures, through this paper we will follow the classical definition of FLOPs. 3 Computation of FLOPs for basic layers The basic operation that dominates training of Neural Network models is the dense matrix-matrix

Dissecting Flops Along Input Dimensions for Greenai Cost Estimations

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support