Release 1.1.7 William Falcon Et

Total Page:16

File Type:pdf, Size:1020Kb

Release 1.1.7 William Falcon Et PyTorch Lightning Documentation Release 1.1.7 William Falcon et al. Feb 03, 2021 GETTING STARTED 1 Lightning in 2 steps 1 2 How to organize PyTorch into Lightning 15 3 Rapid prototyping templates 19 4 Style guide 21 5 Fast performance tips 27 6 Benchmark with vanilla PyTorch 31 7 LightningModule 33 8 Trainer 79 9 Accelerators 109 10 Callback 121 11 LightningDataModule 141 12 Logging 149 13 Metrics 171 14 Plugins 215 15 Step-by-step walk-through 219 16 API References 247 17 Bolts 337 18 Pytorch Ecosystem Examples 339 19 Community Examples 341 20 AWS/GCP training 343 21 16-bit training 345 22 Computing cluster (SLURM) 347 i 23 Child Modules 351 24 Debugging 353 25 Loggers 357 26 Early stopping 361 27 Fast Training 363 28 Hyperparameters 365 29 Learning Rate Finder 371 30 Multi-GPU training 375 31 Multiple Datasets 389 32 Saving and loading weights 391 33 Optimization 397 34 Performance and Bottleneck Profiler 403 35 Single GPU Training 409 36 Sequential Data 411 37 Training Tricks 413 38 Transfer Learning 417 39 TPU support 421 40 Test set 425 41 Inference in Production 429 42 Conversational AI 431 43 Contributor Covenant Code of Conduct 445 44 Contributing 447 45 How to become a core contributor 457 46 PyTorch Lightning Governance | Persons of interest 459 47 Changelog 461 48 Indices and tables 505 Python Module Index 507 Index 509 ii CHAPTER ONE LIGHTNING IN 2 STEPS In this guide we’ll show you how to organize your PyTorch code into Lightning in 2 steps. Organizing your code with PyTorch Lightning makes your code: • Keep all the flexibility (this is all pure PyTorch), but removes a ton of boilerplate • More readable by decoupling the research code from the engineering • Easier to reproduce • Less error-prone by automating most of the training loop and tricky engineering • Scalable to any hardware without changing your model Here’s a 3 minute conversion guide for PyTorch projects: 1.1 Step 0: Install PyTorch Lightning You can install using pip pip install pytorch-lightning Or with conda (see how to install conda here): conda install pytorch-lightning -c conda-forge You could also use conda environments conda activate my_env pip install pytorch-lightning Import the following: import os import torch from torch import nn import torch.nn.functional asF from torchvision import transforms from torchvision.datasets import MNIST from torch.utils.data import DataLoader, random_split import pytorch_lightning as pl 1 PyTorch Lightning Documentation, Release 1.1.7 1.2 Step 1: Define LightningModule class LitAutoEncoder(pl.LightningModule): def __init__(self): super().__init__() self.encoder= nn.Sequential( nn.Linear(28*28, 64), nn.ReLU(), nn.Linear(64,3) ) self.decoder= nn.Sequential( nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28*28) ) def forward(self, x): # in lightning, forward defines the prediction/inference actions embedding= self.encoder(x) return embedding def training_step(self, batch, batch_idx): # training_step defined the train loop. # It is independent of forward x, y= batch x=x.view(x.size(0),-1) z= self.encoder(x) x_hat= self.decoder(z) loss=F.mse_loss(x_hat, x) # Logging to TensorBoard by default self.log('train_loss', loss) return loss def configure_optimizers(self): optimizer= torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer SYSTEM VS MODEL A LightningModule defines a system not a model. Examples of systems are: • Autoencoder • BERT • DQN • GAN • Image classifier • Seq2seq • SimCLR • VAE Under the hood a LightningModule is still just a torch.nn.Module that groups all research code into a single file to make it self-contained: 2 Chapter 1. Lightning in 2 steps PyTorch Lightning Documentation, Release 1.1.7 1.2. Step 1: Define LightningModule 3 PyTorch Lightning Documentation, Release 1.1.7 • The Train loop • The Validation loop • The Test loop • The Model or system of Models • The Optimizer You can customize any part of training (such as the backward pass) by overriding any of the 20+ hooks found in Available Callback hooks class LitAutoEncoder(pl.LightningModule): def backward(self, loss, optimizer, optimizer_idx): loss.backward() FORWARD vs TRAINING_STEP In Lightning we separate training from inference. The training_step defines the full training loop. We encourage users to use the forward to define inference actions. For example, in this case we could define the autoencoder to act as an embedding extractor: def forward(self, x): embeddings= self.encoder(x) return embeddings Of course, nothing is stopping you from using forward from within the training_step. def training_step(self, batch, batch_idx): ... z= self(x) It really comes down to your application. We do, however, recommend that you keep both intents separate. • Use forward for inference (predicting). • Use training_step for training. More details in LightningModule docs. 1.3 Step 2: Fit with Lightning Trainer First, define the data however you want. Lightning just needs a DataLoader for the train/val/test splits. dataset= MNIST(os.getcwd(), download= True, transform=transforms.ToTensor()) train_loader= DataLoader(dataset) Next, init the LightningModule and the PyTorch Lightning Trainer, then call fit with both the data and model. # init model autoencoder= LitAutoEncoder() # most basic trainer, uses good defaults (auto-tensorboard, checkpoints, logs, and ,!more) (continues on next page) 4 Chapter 1. Lightning in 2 steps PyTorch Lightning Documentation, Release 1.1.7 (continued from previous page) # trainer = pl.Trainer(gpus=8) (if you have GPUs) trainer= pl.Trainer() trainer.fit(autoencoder, train_loader) The Trainer automates: • Epoch and batch iteration • Calling of optimizer.step(), backward, zero_grad() • Calling of .eval(), enabling/disabling grads • Saving and loading weights • Tensorboard (see Loggers options) • Multi-GPU training support • TPU support • 16-bit training support Tip: If you prefer to manually manage optimizers you can use the Manual optimization mode (ie: RL, GANs, etc. ). That’s it! These are the main 2 concepts you need to know in Lightning. All the other features of lightning are either features of the Trainer or LightningModule. 1.4 Basic features 1.4.1 Manual vs automatic optimization Automatic optimization With Lightning, you don’t need to worry about when to enable/disable grads, do a backward pass, or update optimizers as long as you return a loss with an attached graph from the training_step, Lightning will automate the optimization. def training_step(self, batch, batch_idx): loss= self.encoder(batch[0]) return loss 1.4. Basic features 5 PyTorch Lightning Documentation, Release 1.1.7 Manual optimization However, for certain research like GANs, reinforcement learning, or something with multiple optimizers or an inner loop, you can turn off automatic optimization and fully control the training loop yourself. First, turn off automatic optimization: trainer= Trainer(automatic_optimization= False) Now you own the train loop! def training_step(self, batch, batch_idx, optimizer_idx): # access your optimizers with use_pl_optimizer=False. Default is True (opt_a, opt_b, opt_c)= self.optimizers(use_pl_optimizer= True) loss_a= self.generator(batch[0]) # use this instead of loss.backward so we can automate half precision, etc... self.manual_backward(loss_a, opt_a, retain_graph=True) self.manual_backward(loss_a, opt_a) opt_a.step() opt_a.zero_grad() loss_b= self.discriminator(batch[0]) self.manual_backward(loss_b, opt_b) ... 1.4.2 Predict or Deploy When you’re done training, you have 3 options to use your LightningModule for predictions. Option 1: Sub-models Pull out any model inside your system for predictions. # ---------------------------------- # to use as embedding extractor # ---------------------------------- autoencoder= LitAutoEncoder.load_from_checkpoint('path/to/checkpoint_file.ckpt') encoder_model= autoencoder.encoder encoder_model.eval() # ---------------------------------- # to use as image generator # ---------------------------------- decoder_model= autoencoder.decoder decoder_model.eval() 6 Chapter 1. Lightning in 2 steps PyTorch Lightning Documentation, Release 1.1.7 Option 2: Forward You can also add a forward method to do predictions however you want. # ---------------------------------- # using the AE to extract embeddings # ---------------------------------- class LitAutoEncoder(pl.LightningModule): def forward(self, x): embedding= self.encoder(x) return embedding autoencoder= LitAutoencoder() autoencoder= autoencoder(torch.rand(1, 28 * 28)) # ---------------------------------- # or using the AE to generate images # ---------------------------------- class LitAutoEncoder(pl.LightningModule): def forward(self): z= torch.rand(1,3) image= self.decoder(z) image= image.view(1,1, 28, 28) return image autoencoder= LitAutoencoder() image_sample= autoencoder() Option 3: Production For production systems, onnx or torchscript are much faster. Make sure you have added a forward method or trace only the sub-models you need. # ---------------------------------- # torchscript # ---------------------------------- autoencoder= LitAutoEncoder() torch.jit.save(autoencoder.to_torchscript(),"model.pt") os.path.isfile("model.pt") # ---------------------------------- # onnx # ---------------------------------- with tempfile.NamedTemporaryFile(suffix='.onnx', delete=False) as tmpfile: autoencoder= LitAutoEncoder() input_sample= torch.randn((1, 28 * 28)) autoencoder.to_onnx(tmpfile.name, input_sample, export_params=True) os.path.isfile(tmpfile.name) 1.4. Basic features 7 PyTorch Lightning Documentation, Release 1.1.7 1.4.3 Using CPUs/GPUs/TPUs
Recommended publications
  • A Taxonomy of Accelerator Architectures and Their
    A taxonomy of accelerator C. Cas$caval S. Chatterjee architectures and their H. Franke K. J. Gildea programming models P. Pattnaik As the clock frequency of silicon chips is leveling off, the computer architecture community is looking for different solutions to continue application performance scaling. One such solution is the multicore approach, i.e., using multiple simple cores that enable higher performance than wide superscalar processors, provided that the workload can exploit the parallelism. Another emerging alternative is the use of customized designs (accelerators) at different levels within the system. These are specialized functional units integrated with the core, specialized cores, attached processors, or attached appliances. The design tradeoff is quite compelling because current processor chips have billions of transistors, but they cannot all be activated or switched at the same time at high frequencies. Specialized designs provide increased power efficiency but cannot be used as general-purpose compute engines. Therefore, architects trade area for power efficiency by placing in the design additional units that are known to be active at different times. The resulting system is a heterogeneous architecture, with the potential of specialized execution that accelerates different workloads. While designing and building such hardware systems is attractive, writing and porting software to a heterogeneous platform is even more challenging than parallelism for homogeneous multicore systems. In this paper, we propose a taxonomy that allows us to define classes of accelerators, with the goal of focusing on a small set of programming models for accelerators. We discuss several types of currently popular accelerators and identify challenges to exploiting such accelerators in current software stacks.
    [Show full text]
  • WWW 2013 22Nd International World Wide Web Conference
    WWW 2013 22nd International World Wide Web Conference General Chairs: Daniel Schwabe (PUC-Rio – Brazil) Virgílio Almeida (UFMG – Brazil) Hartmut Glaser (CGI.br – Brazil) Research Track: Ricardo Baeza-Yates (Yahoo! Labs – Spain & Chile) Sue Moon (KAIST – South Korea) Practice and Experience Track: Alejandro Jaimes (Yahoo! Labs – Spain) Haixun Wang (MSR – China) Developers Track: Denny Vrandečić (Wikimedia – Germany) Marcus Fontoura (Google – USA) Demos Track: Bernadette F. Lóscio (UFPE – Brazil) Irwin King (CUHK – Hong Kong) W3C Track: Marie-Claire Forgue (W3C Training, USA) Workshops Track: Alberto Laender (UFMG – Brazil) Les Carr (U. of Southampton – UK) Posters Track: Erik Wilde (EMC – USA) Fernanda Lima (UNB – Brazil) Tutorials Track: Bebo White (SLAC – USA) Maria Luiza M. Campos (UFRJ – Brazil) Industry Track: Marden S. Neubert (UOL – Brazil) Proceedings and Metadata Chair: Altigran Soares da Silva (UFAM - Brazil) Local Arrangements Committee: Chair – Hartmut Glaser Executive Secretary – Vagner Diniz PCO Liaison – Adriana Góes, Caroline D’Avo, and Renato Costa Conference Organization Assistant – Selma Morais International Relations – Caroline Burle Technology Liaison – Reinaldo Ferraz UX Designer / Web Developer – Yasodara Córdova, Ariadne Mello Internet infrastructure - Marcelo Gardini, Felipe Agnelli Barbosa Administration– Ana Paula Conte, Maria de Lourdes Carvalho, Beatriz Iossi, Carla Christiny de Mello Legal Issues – Kelli Angelini Press Relations and Social Network – Everton T. Rodrigues, S2Publicom and EntreNós PCO – SKL Eventos
    [Show full text]
  • Matrox Imaging Library (MIL) 9.0 Update 58
    ------------------------------------------------------------------------------- Matrox Imaging Library (MIL) 9.0 Update 58. Release Notes (Whatsnew) September 2012 (c) Copyright Matrox Electronic Systems Ltd., 1992-2012. ------------------------------------------------------------------------------- For more information and what's new in processing, display, drivers, Linux, ActiveMIL, and all MIL 9 updates, consult their respective readme files. Main table of contents Section 1 : What's new in Mil 9.0 Update 58 Section 2 : What's new in MIL 9.0 Release 2. Section 3 : What's new in MIL 9.0. Section 4 : Differences between MIL Lite 8.0 and 7.5 Section 5 : Differences between MIL Lite 7.5 and 7.1 Section 6 : Differences between MIL Lite 7.1 and 7.0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Section 1: What's new in MIL 9.0 Update 58. Table of Contents for Section 1 1. Overview. 2. Mseq API function definition 2.1 MseqAlloc 2.2 MseqControl 2.3 MseqDefine 2.4 MseqFeed 2.5 MseqFree 2.6 MseqGetHookInfo 2.7 MseqHookFunction 2.8 MseqInquire 2.9 MseqProcess 3. Examples 4. Operating system information 1. Overview. The main goal for MIL 9.0 Update 58 is to add a new module called Mseq, which offers a user-friendly interface for H.264 compression. 2. Mseq API function definition 2.1 MseqAlloc - Synopsis: Allocate a sequence context. - Syntax: MIL_ID MseqAlloc( MIL_ID SystemID, MIL_INT64 SequenceType, MIL_INT64 Operation, MIL_UINT32 OutputFormat, MIL_INT64 InitFlag, MIL_ID* ContextSeqIdPtr) - Parameters: * SystemID: Specifies the identifier of the system on which to allocate the sequence context. This parameter must be given a valid system identifier. * SequenceType: Specifies the type of sequence to allocate: Values: M_DEFAULT - Specifies the sequence as a context in which the related operation should be performed.
    [Show full text]
  • An FPGA-Accelerated Embedded Convolutional Neural Network
    Master Thesis Report ZynqNet: An FPGA-Accelerated Embedded Convolutional Neural Network (edit) (edit) 1000ch 1000ch FPGA 1000ch Network Analysis Network Analysis 2x512 > 1024 2x512 > 1024 David Gschwend [email protected] SqueezeNet v1.1 b2a ext7 conv10 2x416 > SqueezeNet SqueezeNet v1.1 b2a ext7 conv10 2x416 > SqueezeNet arXiv:2005.06892v1 [cs.CV] 14 May 2020 Supervisors: Emanuel Schmid Felix Eberli Professor: Prof. Dr. Anton Gunzinger August 2016, ETH Zürich, Department of Information Technology and Electrical Engineering Abstract Image Understanding is becoming a vital feature in ever more applications ranging from medical diagnostics to autonomous vehicles. Many applications demand for embedded solutions that integrate into existing systems with tight real-time and power constraints. Convolutional Neural Networks (CNNs) presently achieve record-breaking accuracies in all image understanding benchmarks, but have a very high computational complexity. Embedded CNNs thus call for small and efficient, yet very powerful computing platforms. This master thesis explores the potential of FPGA-based CNN acceleration and demonstrates a fully functional proof-of-concept CNN implementation on a Zynq System-on-Chip. The ZynqNet Embedded CNN is designed for image classification on ImageNet and consists of ZynqNet CNN, an optimized and customized CNN topology, and the ZynqNet FPGA Accelerator, an FPGA-based architecture for its evaluation. ZynqNet CNN is a highly efficient CNN topology. Detailed analysis and optimization of prior topologies using the custom-designed Netscope CNN Analyzer have enabled a CNN with 84.5 % top-5 accuracy at a computational complexity of only 530 million multiply- accumulate operations. The topology is highly regular and consists exclusively of convolu- tional layers, ReLU nonlinearities and one global pooling layer.
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • The Developer's Guide to Azure
    E-book Series The Developer’s Guide to Azure Published May 2019 May The Developer’s 2 2019 Guide to Azure 03 / 40 / 82 / Introduction Chapter 3: Securing Chapter 6: Where your application and how to deploy We’re here to help your Azure services How can Azure help secure 05 / your app? How can Azure deploy your Encryption services? Chapter 1: Getting Azure Security Center Infrastructure as Code started with Azure Logging and monitoring Azure Blueprints Containers in Azure What can Azure do for you? Azure Stack Where to host your 51 / Where to deploy, application and when? Chapter 4: Adding Azure App Service Features Azure Functions intelligence to Azure Logic Apps your application 89 / Azure Batch Containers How can Azure integrate AI Chapter 7: Share your What to use, and when? into your app? code, track work, and ship Making your application Azure Search software more performant Cognitive Services Azure Front Door Azure Bot Service How can Azure help you plan Azure Content Delivery Azure Machine Learning smarter, collaborate better, and ship Network Studio your apps faster? Azure Redis Cache Developer tooling for AI Azure Boards AI and mixed reality Azure Repos Using events and messages in Azure Pipelines 22 / your application Azure Test Plans Azure Artifacts Chapter 2: Connecting your app with data 72 / 98 / What can Azure do for Chapter 5: Connect your your data? business with IoT Chapter 8: Azure in Action Where to store your data Azure Cosmos DB How can Azure connect, secure, Walk-through: Azure portal Azure SQL Database manage, monitor,
    [Show full text]
  • Scheduling Dataflow Execution Across Multiple Accelerators
    Scheduling Dataflow Execution Across Multiple Accelerators Jon Currey, Adam Eversole, and Christopher J. Rossbach Microsoft Research ABSTRACT mon: GPU-based super-computers are the norm [1], and Dataflow execution engines such as MapReduce, DryadLINQ cluster-as-service systems like EC2 make systems with po- and PTask have enjoyed success because they simplify devel- tentially many GPUs widely available [2]. Compute-bound opment for a class of important parallel applications. Ex- algorithms are abundant, even at cluster scale, making ac- pressing the computation as a dataflow graph allows the celeration with specialized hardware an attractive and viable runtime, and not the programmer, to own problems such as approach. synchronization, data movement and scheduling - leverag- Despite the rapid evolution of front-end programming tools ing dynamic information to inform strategy and policy in and runtimes for such systems [14, 44, 35, 16, 22, 36, 43, 12], a way that is impossible for a programmer who must work programming them remains a significant challenge. Writ- only with a static view. While this vision enjoys consider- ing correct code for even a single GPU requires familiar- able intuitive appeal, the degree to which dataflow engines ity with different programming and execution models, while can implement performance profitable policies in the general writing performant code still generally requires considerable case remains under-evaluated. systems- and architectural-level expertise. Developing code We consider the problem of scheduling in a dataflow en- that can effectively utilize multiple, potentially diverse ac- gine on a platform with multiple GPUs. In this setting, celerators is still very much an expert's game.
    [Show full text]
  • The Jabberwocky Programming Environment for Structured Social Computing
    The Jabberwocky Programming Environment for Structured Social Computing Salman Ahmad Alexis Battle Zahan Malkani Sepandar D. Kamvar [email protected] [email protected] [email protected] [email protected] ABSTRACT Dog ManReduce Dormouse We present Jabberwocky, a social computing stack that con- sists of three components: a human and machine resource management system called Dormouse, a parallel program- API ming framework for human and machine computation called Deploy Dog Compiler ManReduce Script Runtime ManReduce, and a high-level programming language on top Dormouse Master of ManReduce called Dog. Dormouse is designed to enable cross-platform programming languages for social computa- Library tion, so, for example, programs written for Mechanical Turk Dormouse can also run on other crowdsourcing platforms. Dormouse Compute Clusters Crowd Workers Dog Script User-Defined ManReduce also enables a programmer to easily combine crowdsourcing Functions Library platforms or create new ones. Further, machines and peo- ple are both first-class citizens in Dormouse, allowing for Figure 1: Overview of Jabberwocky natural parallelization and control flows for a broad range of data-intensive applications. And finally and importantly, Dormouse includes notions of real identity, heterogeneity, has been used to address large-scale goals ranging from la- and social structure. We show that the unique properties beling images [23], to finding 3-D protein structures [3], to of Dormouse enable elegant programming models for com- creating a crowdsourced illustrated book [8], to classifying plex and useful problems, and we propose two such frame- galaxies in Hubble images [1]. works. ManReduce is a framework for combining human In existing paradigms, human workers are often treated as and machine computation into an intuitive parallel data flow homogeneous and interchangeable, which is useful in han- that goes beyond existing frameworks in several important dling issues of scale and availability.
    [Show full text]
  • Windows GUI Context Extraction
    CS2-AA4X Windows GUI Context Extraction a Major Qualifying Project Report submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Bachelor of Science by ________________________ Austin T. Rose March 21, 2017 ________________________ Professor Craig A. Shue 1 1. ABSTRACT In any computer system an intelligent policy for allowing or disallowing low-level actions is critical to security. Such low-level actions may include opening up new connections to the Internet, installing new drivers, or executing downloaded files. In determining whether to allow a given action, it is necessary to collect some context regarding how the action was triggered. Is this connection to an address we have never seen before? Where was this file downloaded from? An important part of that context is whether or not a human user actu- ally requested the action in some way, through their interactions with the Graphical User Interface (GUI). That is an abstract question, which is not as straightforward to answer as others. We seek to determine a user's high-level intentions by extracting and relating properties of the GUI as a user interacts with it. We have created a system that automatically generates information about user activity in a programmatic way, monitoring a Windows computer in real time with a low perfor- mance overhead. The information generated is well structured for consumption by security tools, and to inform policy. Deployed across an organization, this system has the potential to effectively white-list broad categories of user work-flows, in order to easily alert about any concerning anomalous behavior which would warrant further investigation.
    [Show full text]
  • Final Copy 2021 06 24 Foyer
    This electronic thesis or dissertation has been downloaded from Explore Bristol Research, http://research-information.bristol.ac.uk Author: Foyer, Clement M Title: Abstractions for Portable Data Management in Heterogeneous Memory Systems General rights Access to the thesis is subject to the Creative Commons Attribution - NonCommercial-No Derivatives 4.0 International Public License. A copy of this may be found at https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode This license sets out your rights and the restrictions that apply to your access to the thesis so it is important you read this before proceeding. Take down policy Some pages of this thesis may have been removed for copyright restrictions prior to having it been deposited in Explore Bristol Research. However, if you have discovered material within the thesis that you consider to be unlawful e.g. breaches of copyright (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please contact [email protected] and include the following information in your message: •Your contact details •Bibliographic details for the item, including a URL •An outline nature of the complaint Your claim will be investigated and, where appropriate, the item in question will be removed from public view as soon as possible. Abstractions for Portable Data Management in Heterogeneous Memory Systems Clément Foyer supervised by Simon McIntosh-Smith and Adrian Tate and Tim Dykes A dissertation submitted to the University of Bristol in accordance with the requirements for award of the degree of Doctor of Philosophy in the Faculty of Engineering, School of Computer Science.
    [Show full text]
  • SDP Memo 50: the Accelerator Support of Execution Framework
    SDP Memo 50: The Accelerator Support of Execution Framework Document number…………………………………………………………………SDP Memo 50 Document Type…………………………………………………………………………….MEMO Revision………………………………………………………………………………………..1.00 Author………………………………………Feng Wang, Shoulin Wei, Hui Deng and Ying Mei Release Date………………………………………………………………………….2018-09-18 Document Classification……………………………………………………………. Unrestricted Lead Author Designation Affiliation Feng Wang Guangzhou University/Kunming University of Sci & Tech Signature & Date: Document No: 50 Unrestricted Revision: 1.00 Author: Feng Wang et al. Release Date: 2018-9-9 Page 1 of 13 SDP Memo Disclaimer The SDP memos are designed to allow the quick recording of investigations and research done by members of the SDP. They are also designed to raise questions about parts of the SDP design or SDP process. The contents of a memo may be the opinion of the author, not the whole of the SDP. Table of Contents SDP MEMO DISCLAIMER.................................................................................................................................................... 2 TABLE OF CONTENTS ......................................................................................................................................................... 2 LIST OF FIGURES .................................................................................................................................................................. 3 LIST OF TABLES ...................................................................................................................................................................
    [Show full text]
  • Memory-Efficient Pipeline-Parallel DNN Training
    Memory-Efficient Pipeline-Parallel DNN Training Deepak Narayanan 1 * Amar Phanishayee 2 Kaiyu Shi 3 Xie Chen 3 Matei Zaharia 1 Abstract ever, model parallelism, when traditionally deployed, can either lead to resource under-utilization (Narayanan et al., Many state-of-the-art ML results have been ob- 2019) or high communication overhead with good scaling tained by scaling up the number of parameters in only within a multi-GPU server (Shoeybi et al., 2019), and existing models. However, parameters and acti- consequently an increase in training time and dollar cost. vations for such large models often do not fit in the memory of a single accelerator device; this Recent work has proposed pipelined model parallelism means that it is necessary to distribute training to accelerate model-parallel training. For example, of large models over multiple accelerators. In GPipe (Huang et al., 2019) and PipeDream (Harlap et al., this work, we propose PipeDream-2BW, a sys- 2018; Narayanan et al., 2019) push multiple inputs in se- tem that supports memory-efficient pipeline par- quence through a series of workers that each manage one allelism. PipeDream-2BW uses a novel pipelin- model partition, allowing different workers to process dif- ing and weight gradient coalescing strategy, com- ferent inputs in parallel. Na¨ıve pipelining can harm model bined with the double buffering of weights, to convergence due to inconsistent weight versions between ensure high throughput, low memory footprint, the forward and backward passes of a particular input. Ex- and weight update semantics similar to data par- isting techniques trade off memory footprint and throughput allelism.
    [Show full text]