Intel® Software Template Overview

Total Page:16

File Type:pdf, Size:1020Kb

Intel® Software Template Overview Bigdl tutorial Zhichao Li([email protected]) Qiu xin([email protected]) Big Data Technologies, Software and Service Group, Intel Intel® Confidential — INTERNAL USE ONLY About us Software engineer at Intel, and the developer of BigDL Focusing area . Large scale machine learning, deep learning implementation and optimization . Machine learning / deep learning applications on big data 2 Agenda Run things locally - Spark basic – Run BigDL(Keras model support) – Customer case Run in distributed way (Baiduyun) - LeNet - ResNet - Wide&deep 3 After this training, you can . Know how to install and run BigDL . Get familiar with BigDL APIs . Build deep learning models (LeNet, Wide&deep and ResNet)on Apache Spark with BigDL and also Keras 4 Spark basic 5 The Big Data Problem . One machine can not process or even store all the data ! . Solution is to distribute data over cluster of machine 6 7 Apache Spark Apache Spark is a fast and general engine for large-scale data processing. • Up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Unified engine/interface for complete data applications • SQL, Streaming, ML, Graph in the same framework • Write applications quickly in Java, Scala, Python, R • Runs on Hadoop, Mesos, standalone, or in the cloud (K8S is WIP) • Access diverse data sources including HDFS, Cassandra, HBase, and S3. 8 Apache Spark Components SQL Streaming MLlib GraphX Spark Core 9 How does Apache Spark work Dataset (Memory, HDFS, S3, Servers Database) Task …… …… (Python, Scala…) 10 11 Hands on Source code: - https://github.com/zhichao-li/act Machines: - 12 Bigdl introduction Why BigDL 13 Why BigDL? Big data boost deep learning Production ML/DL system is Complex Andrew NG, Baidu, NIPS 2015 Paper 14 Practical challenges • Large scale dataset • Elasticity • Fault tolerance • Performance and scalability • Dynamic resource sharing • Integration with big data ecosystem • Programming tools / languages • Productivity … 15 Hadoop/Spark Ecosystem - Productivity 16 Apache Spark and BigDL - Productivity SQL Streaming MLlib GraphX BigDL Spark Core 17 Productivity End-to-end machine learning pipeline • Large scale data management • Data analytics • Data cleaning and preprocess • Feature engineering • Hyper parameter tuning Fintech: Transaction fraud detection (Powered by Apache Spark and BigDL) … 18 Ease of Use . A friendly API for model define, train, evaluate and predict . Support Scala and Python . Out of box run on Apache Spark . Easy integrate with other Apache Spark component . Scalable development . Easy to deployment . Visualization 19 Rich deep learning features • Tensor, Layers – More than 100 (Linear, Conv2D, Conv3D, Embedding, Recurrent…) • Loss function – Dozens of loss functions(Cross Entropy, SmoothL1, DiceCoffient…) • Optimization algorithm – SGD, Adagrad, Adam… • Save and Load model files – Include torch / caffe / tensorflow 20 High performance from your server . Powered by Intel Math Kernel Library . Extremely high performance on Xeon CPUs – Order of magnitude faster than out of box caffe / torch / tensorflow . Good scalability – Hundreds of nodes – https://www.cray.com/blog/scalable-deep-learning-bigdl-urika-xc- software-suite/ 21 How BigDL run on Apache Spark spark-submit bigdl.jar your bigdl- python python MKL file native lib .zip 22 Bigdl introduction How to get BigDL 23 Get BigDL packages . Pip install – Recommend for python user (only support BigDL on spark 2.2) . Download – If your spark is other version . Maven / Sbt – For Java/scala user . Build from source code – For BigDL developer Pip install $ pip install BigDL 25 Download https://bigdl-project.github.io/0.5.0/#release-download/ 26 Maven / SBT <dependencies> <dependency> <group>com.intel.analytics.bigdl</group> <artifactId>bigdl-SPARK_(1.5/1.6/2.1/2.2)</artifactId> <version>0.5.0</version> </dependency> </dependencies> 27 Build from source code $ git clone https://github.com/intel-analytics/BigDL.git $ cd BigDL $ ./make-dist.sh # For Spark 1.5/1.6 $ ./make-dist.sh -P spark_2.x # For Spark 2.0/2.1/2.2 28 Run BigDL program (pip install) from bigdl.util.common import * from pyspark import SparkContext from bigdl.nn.layer import * # create sparkcontext with bigdl configuration sc = SparkContext.getOrCreate(conf=create_spark_conf().setMaster("local[*]")) init_engine() # prepare the bigdl environment linear = Linear(2, 3) # Try to create a Linear layer $ python your_python_file.py 29 Run BigDL program (on the cluster) spark-submit \ --master xxx --jars path_to_big_dl_jar --py-files path_to_big_dl_python_zip your_python_file …… 30 Bigdl introduction Define, train and evaluate models 31 Define A Model . Sequential API – In sequential API, user adds layers into some containers to build the model . Functional API – In functional API, the model is described as a graph 32 Lenet5 33 Define Lenet5 in Sequential API model = Sequential() model.add(Reshape([1, 28, 28])) model.add(SpatialConvolution(1, 6, 5, 5)) model.add(Tanh()) model.add(SpatialMaxPooling(2, 2, 2, 2)) model.add(Tanh()) model.add(SpatialConvolution(6, 12, 5, 5)) model.add(SpatialMaxPooling(2, 2, 2, 2)) model.add(Reshape([12 * 4 * 4])) model.add(Linear(12 * 4 * 4, 100)) model.add(Tanh()) model.add(Linear(100, 10)) model.add(LogSoftMax()) 34 Define Lenet5 in Functional API reshape1 = Reshape([1, 28, 28])() conv1 = SpatialConvolution(1, 6, 5, 5)(reshape1) tanh1 = Tanh()(conv1) pool1 = SpatialMaxPooling(2, 2, 2, 2)(tanh1) tanh2 = Tanh()(pool1) conv2 = SpatialConvolution(6, 12, 5, 5)(tanh2) pool2 = SpatialMaxPooling(2, 2, 2, 2)(conv2) reshape2 = Reshape([12 * 4 * 4])(pool2) linear1 = Linear(12 * 4 * 4, 100)(reshape2) tanh3 = Tanh()(linear1) linear2 = Linear(100, 10)(tanh3) softmax = LogSoftMax()(linear2) model = Model(reshape1, softmax) 35 Keras Support - Keras 1.2.2 Keras model Layers API - Load Keras Model - Keras-like API TensorFlow saved model Ops API TensorFlow 36 Load Keras model 37 Keras-like API 38 Caffe Support Load caffe model model = Model.load_caffe_model(caffe.prototxt, caffe.model) Load Caffe Model Weights to Predefined BigDL Model model = Model.load_caffe(bigdlModel, caffe.prototxt, caffe.model, match_all=True) 39 Environment Setup Before start this course, you should • Find a Linux/Mac machine • Install python 2.7 and JDK 8 (the versions here are required by this course. BigDL also support python 3.5 and JDK7) • Run pip install numpy scipy pandas matplotlib jupyter BigDL export JAVA_HOME=jdk_path jupyter notebook --notebook-dir=./ --ip=* --no-browser 40 Build digital number classifier with different models Train different neural network models on MNIST dataset. Introduce MNIST dataset . Logistic regression . CNN model 41 Notebook https://github.com/zhichao- li/act/blob/master/notebooks/part1/introduction_to_mnist.ipynb https://github.com/zhichao-li/act/blob/master/notebooks/part1/cnn.ipynb 42 Continue… 43 参与大会现场 互动赢取礼品 欢迎参观英特尔的展览 展位号:100 Train A Model Optimizer . Model . Data – RDD[Sample] – Sample is actually an array of numpy ndarrays . Loss function . Batch size 45 Pipeline Train(python RDD[raw data] Transform (python) RDD[Sample(ndarray,ndarray)] model) 46 Train A Model from bigdl.nn.layer import Linear from bigdl.util.common import * from bigdl.nn.criterion import MSECriterion from bigdl.optim.optimizer import Optimizer, MaxIteration import numpy as np model = Linear(2, 1) Define a model samples = [ Sample.from_ndarray(np.array([5, 5]), np.array([2.0])), Sample.from_ndarray(np.array([-5, -5]), np.array([-2.0])), Sample.from_ndarray(np.array([-2, 5]), np.array([1.3])), Produce some data Sample.from_ndarray(np.array([-5, 2]), np.array([0.1])), Sample.from_ndarray(np.array([5, -2]), np.array([-0.1])), Sample.from_ndarray(np.array([2, -5]), np.array([-1.3])) ] train_data = sc.parallelize(samples, 1) init_engine() optimizer = Optimizer(model, train_data, MSECriterion(), MaxIteration(100), 4) Define training process trained_model = optimizer.optimize() 47 Define when to end the training optimizer = Optimizer(model, train_data, MSECriterion(), MaxEpoch(100), 4) 48 Change the optimization algorithm optimizer = Optimizer(model, train_data, MSECriterion(), MaxIteration(100), 4, optim_method = Adam()) 49 Validate your model in training optimizer.set_validation(batch_size, val_rdd, trigger, validationMethod) • trigger: how often to do validation, maybe each several iterations or epochs • test data: the separate dataset for test • validation method: how to evaluate the model, maybe top1 accuracy, etc. • batch size: how many data evaluate in one time 50 Checkpointing optimizer.set_checkpoint(path, trigger,isOverWrite=True) • path - the directory to save the snapshots • trigger - how often to save the check point 51 Visualization optimizer = Optimizer(...) ... log_dir = 'mylogdir' app_name = 'myapp' train_summary = TrainSummary(log_dir=log_dir, app_name=app_name) val_summary = ValidationSummary(log_dir=log_dir, app_name=app_name) optimizer.set_train_summary(train_summary) optimizer.set_val_summary(val_summary) ... trainedModel = optimizer.optimize() 52 Model Evaluation from bigdl.nn.layer import * from bigdl.util.common import * from bigdl.optim.optimizer import * import numpy as np sc = SparkContext.getOrCreate(conf=create_spark_conf()) init_engine() samples=[Sample.from_ndarray(np.array([1.0, 2.0]), np.array([2.0]))] testSet = sc.parallelize(samples,1) //You can train a model or load an existing model before evaluation. model = Linear(2, 1) evaluateResult = model.evaluate(testSet, 1, [Top1Accuracy()]) print(evaluateResult[0]) 53 Model Prediction from bigdl.nn.layer import *
Recommended publications
  • Other Apis What’S Wrong with Openmp?
    Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor.
    [Show full text]
  • Intel® Optimized AI Frameworks
    Intel® optimized AI frameworks Dr. Fabio Baruffa & Shailen Sobhee Technical Consulting Engineers, Intel IAGS Visit: www.intel.ai/technology Speed up development using open AI software Machine learning Deep learning TOOLKITS App Open source platform for building E2E Analytics & Deep learning inference deployment Open source, scalable, and developers AI applications on Apache Spark* with distributed on CPU/GPU/FPGA/VPU for Caffe*, extensible distributed deep learning TensorFlow*, Keras*, BigDL TensorFlow*, MXNet*, ONNX*, Kaldi* platform built on Kubernetes (BETA) Python R Distributed Intel-optimized Frameworks libraries * • Scikit- • Cart • MlLib (on Spark) * * And more framework Data learn • Random • Mahout optimizations underway • Pandas Forest including PaddlePaddle*, scientists * * • NumPy • e1071 Chainer*, CNTK* & others Intel® Intel® Data Analytics Intel® Math Kernel Library Kernels Distribution Acceleration Library Library for Deep Neural Networks for Python* (Intel® DAAL) (Intel® MKL-DNN) developers Intel distribution High performance machine Open source compiler for deep learning optimized for learning & data analytics Open source DNN functions for model computations optimized for multiple machine learning library CPU / integrated graphics devices (CPU, GPU, NNP) from multiple frameworks (TF, MXNet, ONNX) 2 Visit: www.intel.ai/technology Speed up development using open AI software Machine learning Deep learning TOOLKITS App Open source platform for building E2E Analytics & Deep learning inference deployment Open source, scalable,
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Lecture 6 Learned Feedforward Visual Processing Neural Networks, Deep Learning, Convnets
    William T. Freeman, Antonio Torralba, 2017 Lecture 6 Learned feedforward visual processing Neural Networks, Deep learning, ConvNets Some slides modified from R. Fergus We need translation invariance Lots of useful linear filters… Laplacian Gaussian derivative Gaussian Gabor And many more… High order Gaussian derivatives We need translation and scale invariance Lots of image pyramids… Gaussian Pyr Laplacian Pyr And many more: QMF, steerable, … We need … What is the best representation? • All the previous representation are manually constructed. • Could they be learnt from data? A brief history of Neural Networks enthusiasm time Perceptrons, 1958 Rosenblatt http://www.ecse.rpi.edu/homepages/nagy/PDF_chrono/2011_Na gy_Pace_FR.pdf. Photo by George Nagy 9 http://www.manhattanrarebooks-science.com/rosenblatt.htm Perceptrons, 1958 10 Perceptrons, 1958 enthusiasm time Minsky and Papert, Perceptrons, 1972 12 Perceptrons, 1958 enthusiasm Minsky and Papert, 1972 time Parallel Distributed Processing (PDP), 1986 14 XOR problem Inputs Output 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 PDP authors pointed to the backpropagation algorithm as a breakthrough, allowing multi-layer neural networks to be trained. Among the functions that a multi-layer network can represent but a single-layer network cannot: the XOR function. 15 Perceptrons, PDP book, 1958 1986 enthusiasm Minsky and Papert, 1972 time LeCun conv nets, 1998 Demos: http://yann.lecun.com/exdb/lenet/index.html 17 18 Neural networks to recognize handwritten digits? yes Neural networks for tougher problems? not really http://pub.clement.farabet.net/ecvw09.pdf 19 NIPS 2000 • NIPS, Neural Information Processing Systems, is the premier conference on machine learning.
    [Show full text]
  • Comparative Study of Deep Learning Software Frameworks
    Comparative Study of Deep Learning Software Frameworks Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Research and Technology Center, Robert Bosch LLC {Soheil.Bahrampour, Naveen.Ramakrishnan, fixed-term.Lukas.Schott, Mohak.Shah}@us.bosch.com ABSTRACT such as dropout and weight decay [2]. As the popular- Deep learning methods have resulted in significant perfor- ity of the deep learning methods have increased over the mance improvements in several application domains and as last few years, several deep learning software frameworks such several software frameworks have been developed to have appeared to enable efficient development and imple- facilitate their implementation. This paper presents a com- mentation of these methods. The list of available frame- parative study of five deep learning frameworks, namely works includes, but is not limited to, Caffe, DeepLearning4J, Caffe, Neon, TensorFlow, Theano, and Torch, on three as- deepmat, Eblearn, Neon, PyLearn, TensorFlow, Theano, pects: extensibility, hardware utilization, and speed. The Torch, etc. Different frameworks try to optimize different as- study is performed on several types of deep learning ar- pects of training or deployment of a deep learning algorithm. chitectures and we evaluate the performance of the above For instance, Caffe emphasises ease of use where standard frameworks when employed on a single machine for both layers can be easily configured without hard-coding while (multi-threaded) CPU and GPU (Nvidia Titan X) settings. Theano provides automatic differentiation capabilities which The speed performance metrics used here include the gradi- facilitates flexibility to modify architecture for research and ent computation time, which is important during the train- development. Several of these frameworks have received ing phase of deep networks, and the forward time, which wide attention from the research community and are well- is important from the deployment perspective of trained developed allowing efficient training of deep networks with networks.
    [Show full text]
  • Comparative Study of Caffe, Neon, Theano, and Torch
    Workshop track - ICLR 2016 COMPARATIVE STUDY OF CAFFE,NEON,THEANO, AND TORCH FOR DEEP LEARNING Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Bosch Research and Technology Center fSoheil.Bahrampour,Naveen.Ramakrishnan, fixed-term.Lukas.Schott,[email protected] ABSTRACT Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is per- formed on several types of deep learning architectures and we evaluate the per- formance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important dur- ing the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algo- rithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best perfor- mance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.
    [Show full text]
  • Intel® Software Template Overview
    data analytics on spark Software & Services Group Intel® Confidential — INTERNAL USE ONLY Spark platform for big data analytics What is Apache Spark? • Apache Spark is an open-source cluster-computing framework • Spark has a developed ecosystem • Designed for massively distributed apps • Fault tolerance • Dynamic resource sharing https://software.intel.com/ai 2 INTEL and Spark community • BigDL – deep learning lib for Apache Spark • Intel® Data Analytics Acceleration Library (Intel DAAL) – machine learning lib includes support for Apache Spark • Performance Optimizations across Apache Hadoop and Apache Spark open source projects • Storage and File System: Ceph, Tachyon, and HDFS • Security: Sentry and integration into various Hadoop & Spark modules • Benchmarks Contributions: Big Bench, Hi Bench, Cloud Sort 3 Taxonomy Artificial Intelligence (AI) Machines that can sense, reason, act without explicit programming Machine Learning (ML), a key tool for AI, is the development, and application of algorithms that improve their performance at some task based on experience (previous iterations) BigDL Deep Learning Classic Machine Learning DAAL focus focus Algorithms where multiple layers of neurons Algorithms based on statistical or other learn successively complex representations techniques for estimating functions from examples Dimension Classifi Clusterin Regres- CNN RNN RBM … -ality -cation g sion Reduction Training: Build a mathematical model based on a data set Inference: Use trained model to make predictions about new data 4 BigDLand Intel DAAL BigDL and Intel DAAL are machine learning and data analytics libraries natively integrated into Apache Spark ecosystem 5 WhATis bigdl? • BigDL is a distributed deep learning library for Apache Spark • Allows to write deep learning applications as standard Spark programs • Runs on top of existing Spark or Hadoop/Hive clusters • Feature parity with popular DL frameworks.
    [Show full text]
  • Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction
    TensorFlow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction TensorFlow Google Brain, 2015 (rewritten DistBelief) Theano University of Montréal, 2009 Keras François Chollet, 2015 (now at Google) Torch Facebook AI Research, Twitter, Google DeepMind Caffe Berkeley Vision and Learning Center (BVLC), 2013 Outline 1. Introduction of each framework a. TensorFlow b. Theano c. Keras d. Torch e. Caffe 2. Further comparison a. Code + models b. Community and documentation c. Performance d. Model deployment e. Extra features 3. Which framework to choose when ..? Introduction of each framework TensorFlow architecture 1) Low-level core (C++/CUDA) 2) Simple Python API to define the computational graph 3) High-level API (TF-Learn, TF-Slim, soon Keras…) TensorFlow computational graph - auto-differentiation! - easy multi-GPU/multi-node - native C++ multithreading - device-efficient implementation for most ops - whole pipeline in the graph: data loading, preprocessing, prefetching... TensorBoard TensorFlow development + bleeding edge (GitHub yay!) + division in core and contrib => very quick merging of new hotness + a lot of new related API: CRF, BayesFlow, SparseTensor, audio IO, CTC, seq2seq + so it can easily handle images, videos, audio, text... + if you really need a new native op, you can load a dynamic lib - sometimes contrib stuff disappears or moves - recently introduced bells and whistles are barely documented Presentation of Theano: - Maintained by Montréal University group. - Pioneered the use of a computational graph. - General machine learning tool -> Use of Lasagne and Keras. - Very popular in the research community, but not elsewhere. Falling behind. What is it like to start using Theano? - Read tutorials until you no longer can, then keep going.
    [Show full text]
  • Parallelism in Cilk Plus
    Cilk Plus: Language Support for Thread and Vector Parallelism Arch D. Robison Intel Sr. Principal Engineer Outline Motivation for Intel® Cilk Plus SIMD notations Fork-Join notations Karatsuba multiplication example GCC branch Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Multi-Threading and Vectorization are Essential to Performance Latest Intel® Xeon® chip: 8 cores 2 independent threads per core 8-lane (single prec.) vector unit per thread = 128-fold potential for single socket Intel® Many Integrated Core Architecture >50 cores (KNC) ? independent threads per core 16-lane (single prec.) vector unit per thread = parallel heaven Copyright© 2012, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners. Importance of Abstraction Software outlives hardware. Recompiling is easier than rewriting. Coding too closely to hardware du jour makes moving to new hardware expensive. C++ philosophy: abstraction with minimal penalty Do not expect compiler to be clever. But let it do tedious bookkeeping. Copyright© 2012, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners. “Three Layer Cake” Abstraction Message Passing exploit multiple nodes Fork-Join exploit multiple cores exploit parallelism at multiple algorithmic levels SIMD exploit vector hardware Copyright© 2012, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners. Composition Message Driven compose via send/receive Fork-Join compose via call/return SIMD compose sequentially Copyright© 2012, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners.
    [Show full text]
  • DIY Deep Learning for Vision: the Caffe Framework
    DIY Deep Learning for Vision: the Caffe framework caffe.berkeleyvision.org github.com/BVLC/caffe Evan Shelhamer adapted from the Caffe tutorial with Jeff Donahue, Yangqing Jia, and Ross Girshick. Why Deep Learning? The Unreasonable Effectiveness of Deep Features Classes separate in the deep representations and transfer to many tasks. [DeCAF] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features Maximal activations of pool5 units [R-CNN] conv5 DeConv visualization Rich visual structure of features deep in hierarchy. [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features 1st layer filters image patches that strongly activate 1st layer filters [Zeiler-Fergus] What is Deep Learning? Compositional Models Learned End-to-End What is Deep Learning? Compositional Models Learned End-to-End Hierarchy of Representations - vision: pixel, motif, part, object - text: character, word, clause, sentence - speech: audio, band, phone, word concrete abstract learning What is Deep Learning? Compositional Models Learned End-to-End figure credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning? Compositional Models Learned End-to-End Back-propagation: take the gradient of the model layer-by-layer by the chain rule to yield the gradient of all the parameters. figure credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning? Vast space of models! Caffe models are loss-driven: - supervised - unsupervised slide credit Marc’aurelio Ranzato, CVPR ‘14 tutorial. Convolutional Neural Nets (CNNs): 1989 LeNet: a layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ] Convolutional Nets: 2012 AlexNet: a layered model composed of convolution, + data subsampling, and further operations followed by a holistic + gpu representation and all-in-all a landmark classifier on + non-saturating nonlinearity ILSVRC12.
    [Show full text]
  • Online Power Management for Multi-Cores: a Reinforcement Learning Based Approach
    1 Online Power Management for Multi-cores: A Reinforcement Learning Based Approach Yiming Wang, Weizhe Zhang, Senior Member, IEEE, Meng Hao, Zheng Wang, Member, IEEE Abstract—Power and energy is the first-class design constraint for multi-core processors and is a limiting factor for future-generation supercomputers. While modern processor design provides a wide range of mechanisms for power and energy optimization, it remains unclear how software can make the best use of them. This paper presents a novel approach for runtime power optimization on modern multi-core systems. Our policy combines power capping and uncore frequency scaling to match the hardware power profile to the dynamically changing program behavior at runtime. We achieve this by employing reinforcement learning (RL) to automatically explore the energy-performance optimization space from training programs, learning the subtle relationships between the hardware power profile, the program characteristics, power consumption and program running times. Our RL framework then uses the learned knowledge to adapt the chip’s power budget and uncore frequency to match the changing program phases for any new, previously unseen program. We evaluate our approach on two computing clusters by applying our techniques to 11 parallel programs that were not seen by our RL framework at the training stage. Experimental results show that our approach can reduce the system-level energy consumption by 12%, on average, with less than 3% of slowdown on the application performance. By lowering the uncore frequency to leave more energy budget to allow the processor cores to run at a higher frequency, our approach can reduce the energy consumption by up to 17% while improving the application performance by 5% for specific workloads.
    [Show full text]
  • March 2012 Version 2
    WWRF VIP WG CONNECTED VEHICLES White Paper Connected Vehicles: The Role of Emerging Standards, Security and Privacy, and Machine Learning Editor: SESHADRI MOHAN, CHAIR CONNECTED VEHICLES WORKING GROUP PROFESSOR, SYSTEMS ENGINEERING DEPARTMENT UA LITTLE ROCK, AR 72204, USA Project website address: www.wwrf.ch This publication is partly based on work performed in the framework of the WWRF. It represents the views of the authors(s) and not necessarily those of the WWRF. EXECUTIVE SUMMARY The Internet of Vehicles (IoV) is an emerging technology that provides secure vehicle-to-vehicle (V2V) communication and safety for drivers and their passengers. It stands at the confluence of many evolving disciplines, including: evolving wireless technologies, V2X standards, the Internet of Things (IoT) consisting of a multitude of sensors that are housed in a vehicle, on the roadside, and in the devices worn by pedestrians, the radio technology along with the protocols that can establish an ad-hoc vehicular network, the cloud technology, the field of Big Data, and machine intelligence tools. WWRF is presenting this white paper inspired by the developments that have taken place in recent years in standards organizations such as IEEE and 3GPP and industry consortia efforts as well as current research in academia. This white paper provides insights into the state-of-the-art regarding 3GPP C-V2X as well as security and privacy of ETSI ITS, IEEE DSRC WAVE, 3GPP C-V2X. The White Paper further discusses spectrum allocation worldwide for ITS applications and connected vehicles. A section is devoted to a discussion on providing connected vehicles communication over a heterogonous set of wireless access technologies as it is imperative that the connectivity of vehicles be maintained even when the vehicles are out of coverage and/or the need to maintain vehicular connectivity as a vehicle traverses multiple wireless access technology for access to V2X applications.
    [Show full text]