딥러닝 모델 병렬 처리 Deep Neural Network & Object Recognition 특집 Deep Learning Model Parallelism Ⅰ

Total Page:16

File Type:pdf, Size:1020Kb

딥러닝 모델 병렬 처리 Deep Neural Network & Object Recognition 특집 Deep Learning Model Parallelism Ⅰ https://ettrends.etri.re.kr 2018 Electronics and Telecommunications Trends 딥러닝 모델 병렬 처리 Deep Neural Network & Object Recognition 특집 Deep Learning Model Parallelism Ⅰ. 머리말 Ⅱ. 딥러닝 모델 병렬 처리 개요 Ⅲ. 딥러닝 프레임워크의 모델 병렬 처리 기술 박유미 (Y.M. Park, [email protected]) 고성능컴퓨팅연구그룹 책임연구원 Ⅳ. 딥러닝 프레임워크의 안신영 (S.Y. Ahn, [email protected]) 고성능컴퓨팅연구그룹 책임연구원 모델 병렬 처리 기술 임은지 (E.J. Lim, [email protected]) 고성능컴퓨팅연구그룹 책임연구원 비교 분석 최용석 (Y.S. Choi, [email protected]) 고성능컴퓨팅연구그룹 책임연구원 Ⅴ. 맺음말 우영춘 (Y.C. Woo, [email protected]) IDX 원천기술연구실 책임연구원 최 완 (W. Choi, [email protected]) IDX 원천기술연구실 책임연구원 Deep learning (DL) models have been widely applied to AI applications such image recognition and language translation with big data. Recently, DL models have becomes larger and more complicated, and have merged together. For the accelerated training of a large-scale deep learning model, model parallelism that partitions the model parameters for non-shared parallel access and updates across multiple machines was provided by a few distributed deep learning frameworks. Model parallelism as a training acceleration method, however, is not as commonly used as data parallelism owing to the difficulty of efficient model parallelism. This paper provides a comprehensive survey of the state of the art in model parallelism by comparing the implementation technologies in several deep learning frameworks that support model parallelism, and suggests a future research directions for improving model parallelism technology. * DOI: 10.22648/ETRI.2018.J.330401 *이 논문은 2016년도 정부(과학기술정보통신부)의 재원으로 정보통신기술진흥센터의 지원을 받 아 수행된 연구임[No.2016-0-00087, 대규모 딥러닝 고속 처리를 위한 HPC 시스템 개발]. 본 저작물은 공공누리 제4유형 출처표시+상업적이용금지+변경금지 조건에 따라 이용할 수 있습니다. © 2018 한국전자통신연구원 1 Ⅰ. 머리말 Ⅱ. 딥러닝 모델 병렬 처리 개요 1. 필요성 불과 십여 년 전 암흑 속의 AI 분야를 되살리고 꽃피 우는 데 가장 큰 역할을 한 딥러닝 기술은 최근 4차 산 최근 딥러닝 모델들은 응용의 인식 성능을 높이기 위 업 혁명의 핵심 기술로 자리 잡으며 사람의 지능을 모사 해 모델의 계층이 깊어지고, 고려해야 할 특징이 많아지 하는 다양한 응용에 적용되고 있다. 이미지 인식, 얼굴 는 대규모 네트워크로 구축되고 있다. 대규모 딥러닝 네 인식, 음성 인식 등 인식 분야를 비롯하여 주식 시장에 트워크를 효율적이고 빠르게 트레이닝하기 위해 학계와 서는 주가를 예측하고, 의료계에서는 질병 진단 지원과 산업계에서는 다중 GPGPU(General-Purpose Com- 발병률 예측을 시도하며 심지어 음악, 미술 분야에서는 puting on Graphics Processing Units)를 활용한 딥러 새로운 작품을 창조해 내는 단계에 이르고 있다. 닝 분산 처리 기술을 딥러닝 프레임워크 개발에 적용하 이와 같이 응용의 범위가 넓고 다양해짐에 따라 응용 고 있다[6]-[18]. 현재까지 연구된 딥러닝 분산 처리 기술은 크게 트레 들의 목표 성능을 높이기 위해 다양한 방식으로 딥러닝 이닝 데이터를 다수의 컴퓨터에서 분산 처리하는 데이 모델이 진화하고 있다. 모델의 규모가 커지고 있고 터 병렬 처리(Data Parallelism)와 딥러닝 네트워크를 (Large Scale Model)[1], 모델들을 결합하여(Ensemble 다수의 컴퓨터에서 분산 처리하는 모델 병렬 처리 Model) 목표 성능을 높이고 있고[2], 이질적 형태의 특 (Model Parallelism)로 나뉜다[1], [8]. 징을 함께 처리하기도 하며(Multimodal Model)[3], 기 이중 딥러닝 모델 병렬 처리는 딥러닝 네트워크를 하 존의 모델을 다른 분야에 적용시키기도 한다(Transfer 나의 컴퓨터(또는 GPGPU, FPGA 등 계산 디바이스)에 Learning)[4]. 서 처리하지 못할 경우, 네트워크를 분할하여 다수의 컴 이러한 발전 추세 중 딥러닝 모델의 규모가 커진다는 퓨터에서 처리하는 방법이다. 대표적인 예로 딥러닝 네 것은 모델을 구성하는 뉴런이 많아지고 뉴런을 연결하 트워크 트레이닝 과정에서 계산을 위해 메모리에 상주 는 은닉 계층이 깊어짐을 뜻한다. 이는 SW 엔지니어링 해야 하는 데이터인 파라미터(가중치, 그래디언트 등을 관점에서 볼 때, 딥러닝의 인퍼런스나 트레이닝 과정에 포함하여 트레이닝 과정에 학습되는 값들)의 개수가 늘 계산량이 많아지고 메모리 요구량이 커지면서 한 컴퓨터 어나 계산 가속용으로 이용하는 GPGPU 메모리에 한 (또는 계산 디바이스)의 한계를 넘는 상황이 발생할 수 있 번에 상주할 수 없는 경우에 모델 병렬 처리가 필요하 기 때문에 다수의 컴퓨터가 이를 효율적으로 나누어 처 다. [1]의 예시와 같이 학습 모델은 점점 대규모화되어 리하기 위한 분산 처리 기술이 요구됨을 의미한다[5]. 파라미터 개수가 전장 유전체 분석 모델의 경우 수억에 본고에서는 딥러닝 분산 처리 기술 중 하나인 모델 병 서 수십억 개, 구글 브레인의 이미지 분석 딥러닝 모델 렬 처리 기술을 고찰한다. Ⅱ장에서는 딥러닝 모델 병 에는 수십억 개에서 수백억 개, 뉴스 분석을 위한 토픽 렬 처리의 필요성과 기술적 추세 및 이슈를 설명하고, 모델에서는 1조 개에 이르고 있다. Ⅲ장에서는 딥러닝 분산 프레임워크들의 모델 병렬 처 <표 1>은 BVLC Caffe(v1.0.0)[19]에서 NVIDIA Titan 리 기술을 소개하며, IV장에서 이를 비교 분석한다. 마 X(12GB 메모리)를 이용하여 대규모 CNN 모델들의 지막으로 V장에서 향후 연구가 필요한 방향을 점검하며 ImageNet 트레이닝 중 실측한 데이터이다[20]. 총 6개 결론을 맺는다. 의 딥러닝 모델에 다양한 해상도의 ImageNet 데이터에 2 전자통신동향분석 제33권 제4호 2018년 8월 <표 1> 대규모 CNN 모델의 파라미터 분석[20] Inception_v1 Inception_v3 Resnet_50 Inception_resnet_v2 Resnet_152 VGG16 (googlenet) 레이터 개수(컴포넌트수) 19(223) 13(417) 50(223) 467(1071) 152(674) 16(44) Imagenet 데이터 크기 256×256 300×300 256×256 299×299 299×299 256×256 가중치 개수 13,393,411 24,721,100 31,754,269 56,119,101 66,488,387 138,357,544 가중치 저장 메모리 필요량(MB) 51 94 121 214 254 264 계산시간(ms) 207 289 205 267 176 220 미니배치 크기 64 22 18 6 7 80 총 메모리 사용량(MB) 11,695 11,643 11,651 11,785 11,999 11,591 대해 1회 반복(1 미니배치로 1회 반복) 시 메모리에 상 모델의 분산 병렬 처리 기술에 대해 살펴본다. 주하는 가중치 개수와 규모, 계산 시간, 총 메모리 사용 가. 딥러닝 모델 표현(Representation) 량을 측정하였다. 각 모델들에서의 총 메모리 사용량은 NVIDIA Titan X의 메모리 크기 12GB에 근접한다. 모 사람의 신경망을 모사한 딥러닝 모델은 주로 방향성 델마다 미니배치의 크기를 달리하여 측정한 이유는 모 이 있는 비순환 그래프(DAG: Directed Acyclic Graph) 델의 특성과 ImageNet 입력 데이터의 크기에 따라 필 로 표현된다. 노드로 표현되는 뉴런은 활성화 함수를 의 요한 메모리 요구량이 달라져 이를 GPGPU의 메모리를 미하며 입력이 뉴런을 통해 전달되어 가는 동안 동일한 최대한 이용할 수 있는 최대 미니배치를 적용한 결과이 단계에서 동일한 활성화 함수를 실행하는 뉴런들을 묶 다. 만약 <표 1>의 미니배치보다 더 큰 미니배치를 적용 어 계층이라 부른다. 뉴런들을 연결하는 에지는 입력 계 할 경우, 예를들어 BVLC Caffe에서는 Titan X의 메모 층으로부터 출력 계층으로 나아가는 포워드 방향과 출 리 용량을 넘는 미니배치로 설정한 경우 트레이닝 실행 력 계층으로부터 입력 계층으로 오류가 역전파되는 백 이 불가능하다. SGD(Stochastic Gradient Descent) 기 워드 방향을 내포하며 연결된 뉴런 사이의 가중치를 포 법을 이용하는 분산 딥러닝 모델의 경우, 미니배치 크기 함하고 있다. 가 적을수록 정답에 늦게 수렴하거나 정확도가 저하되 딥러닝 모델 트레이닝과 인퍼런스 과정에서 에지는 기 때문에 가능한 미니배치를 크게 설정하게 된다[8], 뉴런 사이에 데이터 전달을 의미하는데, 이전 뉴런의 계 [21], [22]. 위 실험과 동일한 실행 환경에서 딥러닝 모 델 병렬 처리를 적용한다면 한 컴퓨터당 미니배치를 더 크게 설정할 수 있다. 2. 기술적 이슈 본 절에서는 딥러닝 모델 병렬 처리의 기술 추세와 이 슈를 정리한다. 첫째, 모델 분할을 설명하기에 앞서 현 재 프레임워크들에서 제공하는 모델의 표현 방법을 고 찰하고, 둘째, 모델 병렬 처리를 위한 모델 분할 방법과 이슈를 점검하며, 셋째, 스케줄링과 통신 방법 등 분할 박유미 외 / 딥러닝 모델 병렬 처리 3 산 결과에 가중치를 곱하여 벡터, 매트릭스, 텐서 형태 로 다음 뉴런으로 전달한다. (그림 1)의 예에서 입력 계 층의 벡터값 I와 가중치(입력 계층 I와 은닉 계층 H1 사 이) 매트릭스 WIHI를 곱한 결과가 은닉 계층 H1에 입력 으로 전달된다. (그림 1)의 H11 뉴런은 I×WIH1 중 첫 번 째 열(A)의 합을 입력으로 받고, H12 뉴런은 I×WIH1 중 두 번째 열(B)의 합을 입력으로 받는다. 모델 병렬 처리 시 분할된 모델이 참조하는 가중치는 포워드 시와 백워 드 시 직교하지 않을 수 있기[14], [16] 때문에 발생하는 이슈들을 본 장의 ‘다’ 절에서 설명한다. 나. 딥러닝 모델 분할(Partitioning) 딥러닝 모델은 인공신경망(Artificial Neural Net- work)의 한 종류로 딥러닝 네트워크 모델이라고도 불린 다. 네트워크는 구조적 연결성을 부각하는 용어이고 모 델은 네트워크 구조를 포함한 동작까지 바라보는 포괄 적인 용어로 쓰인다. 본 절에서는 딥러닝 네트워크의 연 (그림 2)는 [9]의 그림을 재구성하여 DNN(Deep 결성 측면에서 분할 방법을 기술하고자 하나 네트워크 Neural Network)의 계층별 모델 분할의 사례를 보였다. 분할 보다 모델 분할이라는 용어가 더 널리 쓰이고 있으 모델을 구성하는 다중 계층들을 한 계층씩 또는 여러 계 므로 모델 분할이라 칭하겠다. 층 묶음으로 나누어(세로 방향으로 분할) 각기 다른 컴 모델 분할 방법으로는 뉴럴 네트워크의 구조적 관점 퓨터 워커(Worker1, Worker2, Worker3, Worker 4)에 에서 계층별 분할[6], [9], [23]과 입력 피처 별 분할 할당하여 처리한다. 딥러닝 트레이닝의 피드 포워드 시 [23], [25], 하이브리드 방법들[8], [16] 그리고 뉴런 단 에는 i번째 계층에서 계산한 값들을 i+1번째 계층에 전 위의 자유 분할 방법들이 연구되고 있다[16], [18]. 달하고, 역전파 시에는 i번째 계층에서 오류로 편미분 된 값들을 i-1번째 계층으로 전달한다. 따라서 계층 분 할로 각 계층의 계산 작업을 처리하는 워커 들 간 데이 터 전송이 필요하다. 두 번째 (그림 3)은 입력 피처를 그룹핑하는 분할 방 법으로(가로 방향으로 분할) 초기 피처들을 두 그룹으로 나누어 서로 다른 컴퓨터 워커(Worker1과 Worker 2)에 할당하여 처리하는 사례이다. 입력 피처 별 분할은 i번 째 계층에서 i+1번째 계층으로 넘어갈 때 Worker1이 로컬 가중치와 계산된 특징값을 Worker 2에 전달해야 4 전자통신동향분석 제33권 제4호 2018년 8월 하고, Worker 2도 마찬가지로 로컬 가중치와 계산된 특 터를 이용하고, 서로 다른 분할이라 해도 일부 파라미터 징값을 Worker 1에 전달해야 한다. 즉, 분산 워커들 사 를 공유할 수 있다. 이는 분할 모델 간 직교성이 없다는 이에 로컬 파라미터를 매 계층, 매 피처 별 분할마다 전 의미이다. 일례로 ‘가’절에서 설명한 (그림 1)의 모델을 달해야 하는 의존성이 생기고 이로 인한 데이터 교환도 계층별로 분할했을 경우, 피드 포워드 시에는 파라미터 불가피하다. 벡터를 열(column) 단위로 참조하지만, 역전파 시에는 세째로 (그림 4)와 같이 계층별 분할과 피처별 분할을 행(row) 단위로 참조한다. 분할된 모델 간 종속적인 데 융합한 하이브리드 분할이 가능하다([9] 그림 재구성).
Recommended publications
  • SOL: Effortless Device Support for AI Frameworks Without Source Code Changes
    SOL: Effortless Device Support for AI Frameworks without Source Code Changes Nicolas Weber and Felipe Huici NEC Laboratories Europe Abstract—Modern high performance computing clusters heav- State of the Art Proposed with SOL ily rely on accelerators to overcome the limited compute power API (Python, C/C++, …) API (Python, C/C++, …) of CPUs. These supercomputers run various applications from different domains such as simulations, numerical applications or Framework Core Framework Core artificial intelligence (AI). As a result, vendors need to be able to Device Backends SOL efficiently run a wide variety of workloads on their hardware. In the AI domain this is in particular exacerbated by the Fig. 1: Abstraction layers within AI frameworks. existance of a number of popular frameworks (e.g, PyTorch, TensorFlow, etc.) that have no common code base, and can vary lines of code to their scripts in order to enable SOL and its in functionality. The code of these frameworks evolves quickly, hardware support. making it expensive to keep up with all changes and potentially We explore two strategies to integrate new devices into AI forcing developers to go through constant rounds of upstreaming. frameworks using SOL as a middleware, to keep the original In this paper we explore how to provide hardware support in AI frameworks without changing the framework’s source code in AI framework unchanged and still add support to new device order to minimize maintenance overhead. We introduce SOL, an types. The first strategy hides the entire offloading procedure AI acceleration middleware that provides a hardware abstraction from the framework, and the second only injects the necessary layer that allows us to transparently support heterogenous hard- functionality into the framework to enable the execution, but ware.
    [Show full text]
  • Deep Learning Frameworks | NVIDIA Developer
    4/10/2017 Deep Learning Frameworks | NVIDIA Developer Deep Learning Frameworks The NVIDIA Deep Learning SDK accelerates widely­used deep learning frameworks such as Caffe, CNTK, TensorFlow, Theano and Torch as well as many other deep learning applications. Choose a deep learning framework from the list below, download the supported version of cuDNN and follow the instructions on the framework page to get started. Caffe is a deep learning framework made with expression, speed, and modularity in mind. Caffe is developed by the Berkeley Vision and Learning Center (BVLC), as well as community contributors and is popular for computer vision. Caffe supports cuDNN v5 for GPU acceleration. Supported interfaces: C, C++, Python, MATLAB, Command line interface Learning Resources Deep learning course: Getting Started with the Caffe Framework Blog: Deep Learning for Computer Vision with Caffe and cuDNN Download Caffe Download cuDNN The Microsoft Cognitive Toolkit —previously known as CNTK— is a unified deep­learning toolkit from Microsoft Research that makes it easy to train and combine popular model types across multiple GPUs and servers. Microsoft Cognitive Toolkit implements highly efficient CNN and RNN training for speech, image and text data. Microsoft Cognitive Toolkit supports cuDNN v5.1 for GPU acceleration. Supported interfaces: Python, C++, C# and Command line interface Download CNTK Download cuDNN TensorFlow is a software library for numerical computation using data flow graphs, developed by Google’s Machine Intelligence research organization. TensorFlow supports cuDNN v5.1 for GPU acceleration. Supported interfaces: C++, Python Download TensorFlow Download cuDNN https://developer.nvidia.com/deep­learning­frameworks 1/3 4/10/2017 Deep Learning Frameworks | NVIDIA Developer Theano is a math expression compiler that efficiently defines, optimizes, and evaluates mathematical expressions involving multi­dimensional arrays.
    [Show full text]
  • 1 Amazon Sagemaker
    1 Amazon SageMaker 13:45~14:30 Amazon SageMaker 14:30~15:15 Amazon SageMaker re:Invent 15:15~15:45 Q&A | 15:45~17:00 Amazon SageMaker 20 SmartNews Data Scientist, Meng Lee Sagemaker SageMaker - FiNC FiNC Technologies SIGNATE Amazon SageMaker SIGNATE CTO 17:00~17:15© 2018, Amazon Web Services, Inc. or itsQ&A Affiliates. All rights reserved. Amazon Confidential and Trademark Amazon SageMaker Makoto Shimura, Solutions Architect 2019/01/15 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • • • • ⎼ Amazon Athena ⎼ AWS Glue ⎼ Amazon SageMaker © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • • Amazon SageMaker • Amazon SageMasker • SageMaker SDK • [ | | ] • Amazon SageMaker • © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 開発 学習 推論推論 学習に使うコードを記述 大量の GPU 大量のCPU や GPU 小規模データで動作確認 大規模データの処理 継続的なデプロイ 試行錯誤の繰り返し 様々なデバイスで動作 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 開発 学習 推論推論 エンジニアがプロダク データサイエンティストが開発環境で作業 ション環境に構築 開発と学習を同じ 1 台のインスタンスで実施 API サーバにデプロイ Deep Learning であれば GPU インスタンスを使用 エッジデバイスで動作 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark & • 開発 学習 推論推論 • エンジニアがプロダク データサイエンティストが開発環境で作業 • ション環境に構築 開発と学習を同じ 1 台のインスタンスで実施 API サーバにデプロイ • Deep Learning であれば GPU インスタンスを使用 エッジデバイスで動作 • API • • © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Amazon SageMaker © 2018, Amazon Web Services, Inc.
    [Show full text]
  • Intel® Optimized AI Frameworks
    Intel® optimized AI frameworks Dr. Fabio Baruffa & Shailen Sobhee Technical Consulting Engineers, Intel IAGS Visit: www.intel.ai/technology Speed up development using open AI software Machine learning Deep learning TOOLKITS App Open source platform for building E2E Analytics & Deep learning inference deployment Open source, scalable, and developers AI applications on Apache Spark* with distributed on CPU/GPU/FPGA/VPU for Caffe*, extensible distributed deep learning TensorFlow*, Keras*, BigDL TensorFlow*, MXNet*, ONNX*, Kaldi* platform built on Kubernetes (BETA) Python R Distributed Intel-optimized Frameworks libraries * • Scikit- • Cart • MlLib (on Spark) * * And more framework Data learn • Random • Mahout optimizations underway • Pandas Forest including PaddlePaddle*, scientists * * • NumPy • e1071 Chainer*, CNTK* & others Intel® Intel® Data Analytics Intel® Math Kernel Library Kernels Distribution Acceleration Library Library for Deep Neural Networks for Python* (Intel® DAAL) (Intel® MKL-DNN) developers Intel distribution High performance machine Open source compiler for deep learning optimized for learning & data analytics Open source DNN functions for model computations optimized for multiple machine learning library CPU / integrated graphics devices (CPU, GPU, NNP) from multiple frameworks (TF, MXNet, ONNX) 2 Visit: www.intel.ai/technology Speed up development using open AI software Machine learning Deep learning TOOLKITS App Open source platform for building E2E Analytics & Deep learning inference deployment Open source, scalable,
    [Show full text]
  • Theano: a Python Framework for Fast Computation of Mathematical Expressions (The Theano Development Team)∗
    Theano: A Python framework for fast computation of mathematical expressions (The Theano Development Team)∗ Rami Al-Rfou,6 Guillaume Alain,1 Amjad Almahairi,1 Christof Angermueller,7, 8 Dzmitry Bahdanau,1 Nicolas Ballas,1 Fred´ eric´ Bastien,1 Justin Bayer, Anatoly Belikov,9 Alexander Belopolsky,10 Yoshua Bengio,1, 3 Arnaud Bergeron,1 James Bergstra,1 Valentin Bisson,1 Josh Bleecher Snyder, Nicolas Bouchard,1 Nicolas Boulanger-Lewandowski,1 Xavier Bouthillier,1 Alexandre de Brebisson,´ 1 Olivier Breuleux,1 Pierre-Luc Carrier,1 Kyunghyun Cho,1, 11 Jan Chorowski,1, 12 Paul Christiano,13 Tim Cooijmans,1, 14 Marc-Alexandre Cotˆ e,´ 15 Myriam Cotˆ e,´ 1 Aaron Courville,1, 4 Yann N. Dauphin,1, 16 Olivier Delalleau,1 Julien Demouth,17 Guillaume Desjardins,1, 18 Sander Dieleman,19 Laurent Dinh,1 Melanie´ Ducoffe,1, 20 Vincent Dumoulin,1 Samira Ebrahimi Kahou,1, 2 Dumitru Erhan,1, 21 Ziye Fan,22 Orhan Firat,1, 23 Mathieu Germain,1 Xavier Glorot,1, 18 Ian Goodfellow,1, 24 Matt Graham,25 Caglar Gulcehre,1 Philippe Hamel,1 Iban Harlouchet,1 Jean-Philippe Heng,1, 26 Balazs´ Hidasi,27 Sina Honari,1 Arjun Jain,28 Sebastien´ Jean,1, 11 Kai Jia,29 Mikhail Korobov,30 Vivek Kulkarni,6 Alex Lamb,1 Pascal Lamblin,1 Eric Larsen,1, 31 Cesar´ Laurent,1 Sean Lee,17 Simon Lefrancois,1 Simon Lemieux,1 Nicholas Leonard,´ 1 Zhouhan Lin,1 Jesse A. Livezey,32 Cory Lorenz,33 Jeremiah Lowin, Qianli Ma,34 Pierre-Antoine Manzagol,1 Olivier Mastropietro,1 Robert T. McGibbon,35 Roland Memisevic,1, 4 Bart van Merrienboer,¨ 1 Vincent Michalski,1 Mehdi Mirza,1 Alberto Orlandi, Christopher Pal,1, 2 Razvan Pascanu,1, 18 Mohammad Pezeshki,1 Colin Raffel,36 Daniel Renshaw,25 Matthew Rocklin, Adriana Romero,1 Markus Roth, Peter Sadowski,37 John Salvatier,38 Franc¸ois Savard,1 Jan Schluter,¨ 39 John Schulman,24 Gabriel Schwartz,40 Iulian Vlad Serban,1 Dmitriy Serdyuk,1 Samira Shabanian,1 Etienne´ Simon,1, 41 Sigurd Spieckermann, S.
    [Show full text]
  • Comparative Study of Deep Learning Software Frameworks
    Comparative Study of Deep Learning Software Frameworks Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Research and Technology Center, Robert Bosch LLC {Soheil.Bahrampour, Naveen.Ramakrishnan, fixed-term.Lukas.Schott, Mohak.Shah}@us.bosch.com ABSTRACT such as dropout and weight decay [2]. As the popular- Deep learning methods have resulted in significant perfor- ity of the deep learning methods have increased over the mance improvements in several application domains and as last few years, several deep learning software frameworks such several software frameworks have been developed to have appeared to enable efficient development and imple- facilitate their implementation. This paper presents a com- mentation of these methods. The list of available frame- parative study of five deep learning frameworks, namely works includes, but is not limited to, Caffe, DeepLearning4J, Caffe, Neon, TensorFlow, Theano, and Torch, on three as- deepmat, Eblearn, Neon, PyLearn, TensorFlow, Theano, pects: extensibility, hardware utilization, and speed. The Torch, etc. Different frameworks try to optimize different as- study is performed on several types of deep learning ar- pects of training or deployment of a deep learning algorithm. chitectures and we evaluate the performance of the above For instance, Caffe emphasises ease of use where standard frameworks when employed on a single machine for both layers can be easily configured without hard-coding while (multi-threaded) CPU and GPU (Nvidia Titan X) settings. Theano provides automatic differentiation capabilities which The speed performance metrics used here include the gradi- facilitates flexibility to modify architecture for research and ent computation time, which is important during the train- development. Several of these frameworks have received ing phase of deep networks, and the forward time, which wide attention from the research community and are well- is important from the deployment perspective of trained developed allowing efficient training of deep networks with networks.
    [Show full text]
  • Comparative Study of Caffe, Neon, Theano, and Torch
    Workshop track - ICLR 2016 COMPARATIVE STUDY OF CAFFE,NEON,THEANO, AND TORCH FOR DEEP LEARNING Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Bosch Research and Technology Center fSoheil.Bahrampour,Naveen.Ramakrishnan, fixed-term.Lukas.Schott,[email protected] ABSTRACT Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is per- formed on several types of deep learning architectures and we evaluate the per- formance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important dur- ing the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algo- rithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best perfor- mance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.
    [Show full text]
  • Toolkits and Libraries for Deep Learning
    J Digit Imaging DOI 10.1007/s10278-017-9965-6 Toolkits and Libraries for Deep Learning Bradley J. Erickson1 & Panagiotis Korfiatis1 & Zeynettin Akkus1 & Timothy Kline 1 & Kenneth Philbrick 1 # The Author(s) 2017. This article is published with open access at Springerlink.com Abstract Deep learning is an important new area of machine the algorithm learns, deep learning approaches learn the impor- learning which encompasses a wide range of neural network tant features as well as the proper weighting of those features to architectures designed to complete various tasks. In the medical make predictions for new data. In this paper, we will describe imaging domain, example tasks include organ segmentation, le- some of the libraries and tools that are available to aid in the sion detection, and tumor classification. The most popular net- construction and efficient execution of deep learning as applied work architecture for deep learning for images is the to medical images. convolutional neural network (CNN). Whereas traditional ma- chine learning requires determination and calculation of features How to Evaluate a Toolkit from which the algorithm learns, deep learning approaches learn the important features as well as the proper weighting of those There is not a single criterion for determining the best toolkit for features to make predictions for new data. In this paper, we will deep learning. Each toolkit was designed and built to address the describe some of the libraries and tools that are available to aid in needs perceived by the developer(s) and also reflects their skills the construction and efficient execution of deep learning as ap- and approaches to problems.
    [Show full text]
  • Chainer and Chainerx
    The Frontier of Define-by-Run Deep Learning Frameworks GTC 2019 @ San Jose. Mar. 20, 2019 Seiya Tokui, Preferred Networks, Inc. S9380 Deep Learning Framework for fast iterative research/development 2 Define-by-Run frameworks by default from 2.0 3 x = numpy.array(…) h1 = layer1(x, W1) Write forward prop h2 = layer2(h1, W2) as a plain Python script. loss = loss_func(h2) loss.backward() Variables hold how they W1.array -= lr * W1.grad were computed. Use it to W2.array -= lr * W2.grad compute the gradient. 4 Deep learning framework optimized for the Define-by-Run API design 5 ✓ Model description ✓ Distributed training ✓ Serialization, export …… Everything is optimized for Define-by-Run style programming 6 class Linear(chainer.Link): Tie parameters to the def __init__(self, n_in, n_out): forward code using OOP. super().__init__() with self.init_scope(): self.W = chainer.Parameter(I.HeNormal(), (n_in, n_out)) self.b = chainer.Parameter(0, (n_out,)) def forward(self, x): return x @ self.W + self.b 7 class MLP(chainer.Chain): def __init__(self): super().__init__() with self.init_scope(): Object structure = self.l1 = Linear(784, 200) composition of NN fragments self.l2 = Linear(200, 100) self.l3 = Linear(100, 10) def forward(self, x): h1 = F.relu(self.l1(x)) h2 = F.relu(self.l2(h1)) return self.l3(h2) 8 for batch in iterator: # fetch the next minibatch x, t = converter(batch) # concat, transfer to the device loss = loss_fun(x, t) # forward prop loss.backward() # backprop optimizer.update() # update parameters model.cleargrad() # cleanup gradients
    [Show full text]
  • Which Deep Learning Framework Is Growing Fastest? Integrations
    Which Deep Learning Framework is Growing Fastest? TensorFlow vs. PyTorch Jeff Hale Follow Apr 1 · 8 min read In September 2018, I compared all the major deep learning frameworks in terms of demand, usage, and popularity in this article. TensorFlow was the undisputed heavyweight champion of deep learning frameworks. PyTorch was the young rookie with lots of buzz. ] How has the landscape changed for the leading deep learning frameworks in the past six months? To answer that question, I looked at the number of job listings on Indeed, Monster, LinkedIn, and SimplyHired. I also evaluated changes in Google search volume, GitHub activity, Medium articles, ArXiv articles, and Quora topic followers. Overall, these sources paint a comprehensive picture of growth in demand, usage, and interest. Integrations and Updates We’ve recently seen several important developments in the TensorFlow and PyTorch frameworks. PyTorch v1.0 was pre-released in October 2018, at the same time fastai v1.0 was released. Both releases marked major milestones in the maturity of the frameworks. TensorFlow 2.0 alpha was released March 4, 2019. It added new features and an improved user experience. It more tightly integrates Keras as its high-level API, too. Methodology In this article, I include Keras and fastai in the comparisons because of their tight integrations with TensorFlow and PyTorch. They also provide scale for evaluating TensorFlow and PyTorch. I won’t be exploring other deep learning frameworks in this article. I expect I will receive feedback that Caffe, Theano, MXNET, CNTK, DeepLearning4J, or Chainer deserve to be discussed. While these frameworks each have their virtues, none appear to be on a growth trajectory likely to put them near TensorFlow or PyTorch.
    [Show full text]
  • Chainer: a Next-Generation Open Source Framework for Deep Learning
    Chainer: a Next-Generation Open Source Framework for Deep Learning Seiya Tokui Kenta Oono Shohei Hido Preferred Networks Preferred Networks Preferred Networks America Tokyo, Japan. Tokyo, Japan. San Mateo, CA. [email protected] [email protected] [email protected] Justin Clayton Preferred Networks America San Mateo, CA. [email protected] Abstract Software frameworks for neural networks play key roles in the development and application of deep learning methods. However, as new types of deep learning models are developed, existing frameworks designed for convolutional neural net- works are becoming less useful. In this paper, we introduce Chainer, a Python- based, standalone open source framework for deep learning models. Chainer pro- vides a flexible, intuitive, and high performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational autoencoders. 1 Introduction Deep learning is driving the third wave of artificial intelligence research [13]. Recent papers indicate that deep learning is moving beyond its early successes in pattern recognition and towards new applications in diverse domains and industries. In order to put these research ideas into practice, a software framework for deep learning is needed. Implementing neural networks (NNs) requires a set of specialized building blocks, including mul- tidimensional arrays, activation functions, and autonomous gradient computation. To avoid dupli- cating these tools, many developers use open source deep learning frameworks such as Caffe [9] or Torch [6]. Because deep learning was first used successfully in the areas of computer vision and speech recognition, existing deep learning frameworks were designed mainly for feed-forward net- works such as convolutional neural networks (CNNs), which are effective for analyzing data samples of fixed length, such as images.
    [Show full text]
  • Ezgi Korkmaz Outline
    KTH ROYAL INSTITUTE OF TECHNOLOGY BigDL: A Distributed Deep Learning Framework for Big Data ACM Symposium on Cloud Computing 2019 [Acceptance Rate: 24%] Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, Jiao Wang, Sheng- sheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie Shi, Qi Lu, Kai Huang, Guoqiong Song. [Intel Corporation] Presenter: Ezgi Korkmaz Outline I Deep learning frameworks I BigDL applications I Motivation for end-to-end framework I Drawbacks of prior approaches I BigDL framework I Experimental Setup and Results of BigDL framework I Critique of the paper 2/16 Deep Learning Frameworks I Big demand from organizations to apply deep learning to big data I Deep learning frameworks: I Torch [2002 Collobert et al.] [ C, Lua] I Caffe [2014 Berkeley BAIR] [C++] I TensorFlow [2015 Google Brain] [C++, Python, CUDA] I Apache MXNet [2015 Apache Software Foundation] [C++] I Chainer [2015 Preferred Networks] [ Python] I Keras [2016 Francois Chollet] [Python] I PyTorch [2016 Facebook] [Python, C++, CUDA] I Apache Spark is an open-source distributed general-purpose cluster-computing framework. I Provides interface for programming clusters with data parallelism 3/16 BigDL I A library on top of Apache Spark I Provides integrated data-analytics within a unifed data analysis pipeline I Allows users to write their own deep learning applications I Running directly on big data clusters I Supports similar API to Torch and Keras I Supports both large scale distributed training and inference I Able to run across hundreds or thousands servers efficiently by uses underlying Spark framework 4/16 BigDL I Developed as an open source project I Used by I Mastercard I WorldBank I Cray I Talroo I UCSF I JD I UnionPay I GigaSpaces I Wide range of applications: transfer learning based image classification, object detection, feature extraction, sequence-to-sequence prediction, neural collaborative filtering for reccomendation etc.
    [Show full text]