Energy and Policy Considerations for Deep Learning in NLP

Energy and Policy Considerations for Deep Learning in NLP Emma Strubell Ananya Ganesh Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst fstrubell, aganesh, [email protected] Abstract Consumption CO2e (lbs) Air travel, 1 passenger, NY$SF 1984 Recent progress in hardware and methodol- Human life, avg, 1 year 11,023 ogy for training neural networks has ushered in a new generation of large networks trained American life, avg, 1 year 36,156 on abundant data. These models have ob- Car, avg incl. fuel, 1 lifetime 126,000 tained notable gains in accuracy across many NLP tasks. However, these accuracy improve- Training one model (GPU) ments depend on the availability of exception- NLP pipeline (parsing, SRL) 31 ally large computational resources that neces- w/ tuning & experimentation 52,909 sitate similarly substantial energy consump- Transformer (big) 192 tion. As a result these models are costly to train and develop, both financially, due to the w/ neural architecture search 626,155 cost of hardware and electricity or cloud com- Table 1: Estimated CO emissions from training compute time, and environmentally, due to the car- 2 mon NLP models, compared to familiar consumption.1 bon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers NLP models could be trained and developed on by quantifying the approximate financial and environmental costs of training a variety of re- a commodity laptop or server, many now require cently successful neural network models for multiple instances of specialized hardware such as NLP. Based on these findings, we propose ac- GPUs or TPUs, therefore limiting access to these tionable recommendations to reduce costs and highly accurate models on the basis of finances. improve equity in NLP research and practice. Even when these expensive computational re- 1 Introduction sources are available, model training also incurs a substantial cost to the environment due to the en- Advances in techniques and hardware for train- ergy required to power this hardware for weeks or ing deep neural networks have recently en- months at a time. Though some of this energy may abled impressive accuracy improvements across come from renewable or carbon credit-offset re- many fundamental NLP tasks (Bahdanau et al., sources, the high energy demands of these models 2015; Luong et al., 2015; Dozat and Man- are still a concern since (1) energy is not currently ning, 2017; Vaswani et al., 2017), with the derived from carbon-neural sources in many loca- most computationally-hungry models obtaining tions, and (2) when renewable energy is available, the highest scores (Peters et al., 2018; Devlin et al., it is still limited to the equipment we have to pro- 2019; Radford et al., 2019; So et al., 2019). As duce and store it, and energy spent training a neu- a result, training a state-of-the-art model now re- ral network might better be allocated to heating a quires substantial computational resources which family’s home. It is estimated that we must cut demand considerable energy, along with the as- carbon emissions by half over the next decade to sociated financial and environmental costs. Re- deter escalating rates of natural disaster, and based search and development of new models multiplies on the estimated CO2 emissions listed in Table1, these costs by thousands of times by requiring re- 1Sources: (1) Air travel and per-capita consumption: training to experiment with model architectures https://bit.ly/2Hw0xWc; (2) car lifetime: https: and hyperparameters. Whereas a decade ago most //bit.ly/2Qbr0w1. model training and development likely make up Consumer Renew. Gas Coal Nuc. a substantial portion of the greenhouse gas emis- China 22% 3% 65% 4% sions attributed to many NLP researchers. Germany 40% 7% 38% 13% To heighten the awareness of the NLP commu- United States 17% 35% 27% 19% nity to this issue and promote mindful practice and Amazon-AWS 17% 24% 30% 26% policy, we characterize the dollar cost and carbon Google 56% 14% 15% 10% emissions that result from training the neural net- Microsoft 32% 23% 31% 10% works at the core of many state-of-the-art NLP models. We do this by estimating the kilowatts Table 2: Percent energy sourced from: Renewable (e.g. of energy required to train a variety of popular hydro, solar, wind), natural gas, coal and nuclear for off-the-shelf NLP models, which can be converted the top 3 cloud compute providers (Cook et al., 2017), 4 5 to approximate carbon emissions and electricity compared to the United States, China and Germany costs. To estimate the even greater resources re- (Burger, 2019). quired to transfer an existing model to a new task or develop new models, we perform a case study We estimate the total time expected for mod- of the full computational resources required for the els to train to completion using training times and development and tuning of a recent state-of-the-art hardware reported in the original papers. We then NLP pipeline (Strubell et al., 2018). We conclude calculate the power consumption in kilowatt-hours with recommendations to the community based on (kWh) as follows. Let pc be the average power our findings, namely: (1) Time to retrain and sen- draw (in watts) from all CPU sockets during train- sitivity to hyperparameters should be reported for ing, let pr be the average power draw from all NLP machine learning models; (2) academic re- DRAM (main memory) sockets, let pg be the aver- searchers need equitable access to computational age power draw of a GPU during training, and let resources; and (3) researchers should prioritize de- g be the number of GPUs used to train. We esti- veloping efficient models and hardware. mate total power consumption as combined GPU, CPU and DRAM consumption, then multiply this 2 Methods by Power Usage Effectiveness (PUE), which ac- counts for the additional energy required to sup- To quantify the computational and environmen- port the compute infrastructure (mainly cooling). tal cost of training deep neural network mod- We use a PUE coefficient of 1.58, the 2018 global els for NLP, we perform an analysis of the en- average for data centers (Ascierto, 2018). Then the ergy required to train a variety of popular off- total power p required at a given instance during the-shelf NLP models, as well as a case study of t training is given by: the complete sum of resources required to develop LISA (Strubell et al., 2018), a state-of-the-art NLP 1:58t(p + p + gp ) p = c r g (1) model from EMNLP 2018, including all tuning t 1000 and experimentation. The U.S. Environmental Protection Agency (EPA) We measure energy use as follows. We train the provides average CO2 produced (in pounds per models described in x2.1 using the default settings kilowatt-hour) for power consumed in the U.S. provided, and sample GPU and CPU power con- (EPA, 2018), which we use to convert power to sumption during training. Each model was trained estimated CO2 emissions: for a maximum of 1 day. We train all models on a single NVIDIA Titan X GPU, with the excep- CO2e = 0:954pt (2) tion of ELMo which was trained on 3 NVIDIA This conversion takes into account the relative pro- GTX 1080 Ti GPUs. While training, we repeat- portions of different energy sources (primarily nat- edly query the NVIDIA System Management In- ural gas, coal, nuclear and renewable) consumed terface2 to sample the GPU power consumption to produce energy in the United States. Table2 and report the average over all samples. To sample lists the relative energy sources for China, Ger- CPU power consumption, we use Intel’s Running many and the United States compared to the top Average Power Limit interface.3 5U.S. Dept. of Energy: https://bit.ly/2JTbGnI 2nvidia-smi: https://bit.ly/30sGEbi 5China Electricity Council; trans. China Energy Portal: 3RAPL power meter: https://bit.ly/2LObQhV https://bit.ly/2QHE5O3 three cloud service providers. The U.S. break- ence. Devlin et al.(2019) report that the BERT down of energy is comparable to that of the most base model (110M parameters) was trained on 16 popular cloud compute service, Amazon Web Ser- TPU chips for 4 days (96 hours). NVIDIA reports vices, so we believe this conversion to provide a that they can train a BERT model in 3.3 days (79.2 reasonable estimate of CO2 emissions per kilowatt hours) using 4 DGX-2H servers, totaling 64 Tesla hour of compute energy used. V100 GPUs (Forster et al., 2019). GPT-2. This model is the latest edition of 2.1 Models OpenAI’s GPT general-purpose token encoder, We analyze four models, the computational re- also based on Transformer-style self-attention and quirements of which we describe below. All mod- trained with a language modeling objective (Rad- els have code freely available online, which we ford et al., 2019). By training a very large model used out-of-the-box. For more details on the mod- on massive data, Radford et al.(2019) show high els themselves, please refer to the original papers. zero-shot performance on question answering and language modeling benchmarks. The large model Transformer. The Transformer model (Vaswani described in Radford et al.(2019) has 1542M pa- et al., 2017) is an encoder-decoder architecture rameters and is reported to require 1 week (168 primarily recognized for efficient and accurate ma- hours) of training on 32 TPU v3 chips. 6 chine translation. The encoder and decoder each consist of 6 stacked layers of multi-head self- attention.

Load more