Intel® Software Template Overview

Bigdl tutorial Zhichao Li([email protected]) Qiu xin([email protected]) Big Data Technologies, Software and Service Group, Intel Intel® Confidential — INTERNAL USE ONLY About us Software engineer at Intel, and the developer of BigDL Focusing area . Large scale machine learning, deep learning implementation and optimization . Machine learning / deep learning applications on big data 2 Agenda Run things locally - Spark basic – Run BigDL(Keras model support) – Customer case Run in distributed way (Baiduyun) - LeNet - ResNet - Wide&deep 3 After this training, you can . Know how to install and run BigDL . Get familiar with BigDL APIs . Build deep learning models (LeNet, Wide&deep and ResNet)on Apache Spark with BigDL and also Keras 4 Spark basic 5 The Big Data Problem . One machine can not process or even store all the data ! . Solution is to distribute data over cluster of machine 6 7 Apache Spark Apache Spark is a fast and general engine for large-scale data processing. • Up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Unified engine/interface for complete data applications • SQL, Streaming, ML, Graph in the same framework • Write applications quickly in Java, Scala, Python, R • Runs on Hadoop, Mesos, standalone, or in the cloud (K8S is WIP) • Access diverse data sources including HDFS, Cassandra, HBase, and S3. 8 Apache Spark Components SQL Streaming MLlib GraphX Spark Core 9 How does Apache Spark work Dataset (Memory, HDFS, S3, Servers Database) Task …… …… (Python, Scala…) 10 11 Hands on Source code: - https://github.com/zhichao-li/act Machines: - 12 Bigdl introduction Why BigDL 13 Why BigDL? Big data boost deep learning Production ML/DL system is Complex Andrew NG, Baidu, NIPS 2015 Paper 14 Practical challenges • Large scale dataset • Elasticity • Fault tolerance • Performance and scalability • Dynamic resource sharing • Integration with big data ecosystem • Programming tools / languages • Productivity … 15 Hadoop/Spark Ecosystem - Productivity 16 Apache Spark and BigDL - Productivity SQL Streaming MLlib GraphX BigDL Spark Core 17 Productivity End-to-end machine learning pipeline • Large scale data management • Data analytics • Data cleaning and preprocess • Feature engineering • Hyper parameter tuning Fintech: Transaction fraud detection (Powered by Apache Spark and BigDL) … 18 Ease of Use . A friendly API for model define, train, evaluate and predict . Support Scala and Python . Out of box run on Apache Spark . Easy integrate with other Apache Spark component . Scalable development . Easy to deployment . Visualization 19 Rich deep learning features • Tensor, Layers – More than 100 (Linear, Conv2D, Conv3D, Embedding, Recurrent…) • Loss function – Dozens of loss functions(Cross Entropy, SmoothL1, DiceCoffient…) • Optimization algorithm – SGD, Adagrad, Adam… • Save and Load model files – Include torch / caffe / tensorflow 20 High performance from your server . Powered by Intel Math Kernel Library . Extremely high performance on Xeon CPUs – Order of magnitude faster than out of box caffe / torch / tensorflow . Good scalability – Hundreds of nodes – https://www.cray.com/blog/scalable-deep-learning-bigdl-urika-xc- software-suite/ 21 How BigDL run on Apache Spark spark-submit bigdl.jar your bigdl- python python MKL file native lib .zip 22 Bigdl introduction How to get BigDL 23 Get BigDL packages . Pip install – Recommend for python user (only support BigDL on spark 2.2) . Download – If your spark is other version . Maven / Sbt – For Java/scala user . Build from source code – For BigDL developer Pip install $ pip install BigDL 25 Download https://bigdl-project.github.io/0.5.0/#release-download/ 26 Maven / SBT <dependencies> <dependency> <group>com.intel.analytics.bigdl</group> <artifactId>bigdl-SPARK_(1.5/1.6/2.1/2.2)</artifactId> <version>0.5.0</version> </dependency> </dependencies> 27 Build from source code $ git clone https://github.com/intel-analytics/BigDL.git $ cd BigDL $ ./make-dist.sh # For Spark 1.5/1.6 $ ./make-dist.sh -P spark_2.x # For Spark 2.0/2.1/2.2 28 Run BigDL program (pip install) from bigdl.util.common import * from pyspark import SparkContext from bigdl.nn.layer import * # create sparkcontext with bigdl configuration sc = SparkContext.getOrCreate(conf=create_spark_conf().setMaster("local[*]")) init_engine() # prepare the bigdl environment linear = Linear(2, 3) # Try to create a Linear layer $ python your_python_file.py 29 Run BigDL program (on the cluster) spark-submit \ --master xxx --jars path_to_big_dl_jar --py-files path_to_big_dl_python_zip your_python_file …… 30 Bigdl introduction Define, train and evaluate models 31 Define A Model . Sequential API – In sequential API, user adds layers into some containers to build the model . Functional API – In functional API, the model is described as a graph 32 Lenet5 33 Define Lenet5 in Sequential API model = Sequential() model.add(Reshape([1, 28, 28])) model.add(SpatialConvolution(1, 6, 5, 5)) model.add(Tanh()) model.add(SpatialMaxPooling(2, 2, 2, 2)) model.add(Tanh()) model.add(SpatialConvolution(6, 12, 5, 5)) model.add(SpatialMaxPooling(2, 2, 2, 2)) model.add(Reshape([12 * 4 * 4])) model.add(Linear(12 * 4 * 4, 100)) model.add(Tanh()) model.add(Linear(100, 10)) model.add(LogSoftMax()) 34 Define Lenet5 in Functional API reshape1 = Reshape([1, 28, 28])() conv1 = SpatialConvolution(1, 6, 5, 5)(reshape1) tanh1 = Tanh()(conv1) pool1 = SpatialMaxPooling(2, 2, 2, 2)(tanh1) tanh2 = Tanh()(pool1) conv2 = SpatialConvolution(6, 12, 5, 5)(tanh2) pool2 = SpatialMaxPooling(2, 2, 2, 2)(conv2) reshape2 = Reshape([12 * 4 * 4])(pool2) linear1 = Linear(12 * 4 * 4, 100)(reshape2) tanh3 = Tanh()(linear1) linear2 = Linear(100, 10)(tanh3) softmax = LogSoftMax()(linear2) model = Model(reshape1, softmax) 35 Keras Support - Keras 1.2.2 Keras model Layers API - Load Keras Model - Keras-like API TensorFlow saved model Ops API TensorFlow 36 Load Keras model 37 Keras-like API 38 Caffe Support Load caffe model model = Model.load_caffe_model(caffe.prototxt, caffe.model) Load Caffe Model Weights to Predefined BigDL Model model = Model.load_caffe(bigdlModel, caffe.prototxt, caffe.model, match_all=True) 39 Environment Setup Before start this course, you should • Find a Linux/Mac machine • Install python 2.7 and JDK 8 (the versions here are required by this course. BigDL also support python 3.5 and JDK7) • Run pip install numpy scipy pandas matplotlib jupyter BigDL export JAVA_HOME=jdk_path jupyter notebook --notebook-dir=./ --ip=* --no-browser 40 Build digital number classifier with different models Train different neural network models on MNIST dataset. Introduce MNIST dataset . Logistic regression . CNN model 41 Notebook https://github.com/zhichao- li/act/blob/master/notebooks/part1/introduction_to_mnist.ipynb https://github.com/zhichao-li/act/blob/master/notebooks/part1/cnn.ipynb 42 Continue… 43 参与大会现场互动赢取礼品欢迎参观英特尔的展览展位号：100 Train A Model Optimizer . Model . Data – RDD[Sample] – Sample is actually an array of numpy ndarrays . Loss function . Batch size 45 Pipeline Train(python RDD[raw data] Transform (python) RDD[Sample(ndarray,ndarray)] model) 46 Train A Model from bigdl.nn.layer import Linear from bigdl.util.common import * from bigdl.nn.criterion import MSECriterion from bigdl.optim.optimizer import Optimizer, MaxIteration import numpy as np model = Linear(2, 1) Define a model samples = [ Sample.from_ndarray(np.array([5, 5]), np.array([2.0])), Sample.from_ndarray(np.array([-5, -5]), np.array([-2.0])), Sample.from_ndarray(np.array([-2, 5]), np.array([1.3])), Produce some data Sample.from_ndarray(np.array([-5, 2]), np.array([0.1])), Sample.from_ndarray(np.array([5, -2]), np.array([-0.1])), Sample.from_ndarray(np.array([2, -5]), np.array([-1.3])) ] train_data = sc.parallelize(samples, 1) init_engine() optimizer = Optimizer(model, train_data, MSECriterion(), MaxIteration(100), 4) Define training process trained_model = optimizer.optimize() 47 Define when to end the training optimizer = Optimizer(model, train_data, MSECriterion(), MaxEpoch(100), 4) 48 Change the optimization algorithm optimizer = Optimizer(model, train_data, MSECriterion(), MaxIteration(100), 4, optim_method = Adam()) 49 Validate your model in training optimizer.set_validation(batch_size, val_rdd, trigger, validationMethod) • trigger: how often to do validation, maybe each several iterations or epochs • test data: the separate dataset for test • validation method: how to evaluate the model, maybe top1 accuracy, etc. • batch size: how many data evaluate in one time 50 Checkpointing optimizer.set_checkpoint(path, trigger,isOverWrite=True) • path - the directory to save the snapshots • trigger - how often to save the check point 51 Visualization optimizer = Optimizer(...) ... log_dir = 'mylogdir' app_name = 'myapp' train_summary = TrainSummary(log_dir=log_dir, app_name=app_name) val_summary = ValidationSummary(log_dir=log_dir, app_name=app_name) optimizer.set_train_summary(train_summary) optimizer.set_val_summary(val_summary) ... trainedModel = optimizer.optimize() 52 Model Evaluation from bigdl.nn.layer import * from bigdl.util.common import * from bigdl.optim.optimizer import * import numpy as np sc = SparkContext.getOrCreate(conf=create_spark_conf()) init_engine() samples=[Sample.from_ndarray(np.array([1.0, 2.0]), np.array([2.0]))] testSet = sc.parallelize(samples,1) //You can train a model or load an existing model before evaluation. model = Linear(2, 1) evaluateResult = model.evaluate(testSet, 1, [Top1Accuracy()]) print(evaluateResult[0]) 53 Model Prediction from bigdl.nn.layer import *

Intel® Software Template Overview

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support