Keras’s Phylanx Backend

Bita Hasheminezhad STE||AR GROUP Outline

– Available Platforms – What’s Special about – Keras Backends – Inference Example 1; Multi-Class classification – Inference Example 2; Sentiment Analysis – Keras In Future – Conclusion

2 Deep Learning Platforms

– Spark Apache – Berkeley AI Research – DistBelief Google – Caffe2 Facebook – TensorFlow Google – PyTorch Facebook – CNTK Microsoft – SINGA National university of Singapore – Project Adam Microsoft – Chainer Preferred Networks – MXNet Apache – CoreML Apple – Universite de Montreal

3 Deep Learning Platforms

– Spark Apache – Caffe -> Caffe2 -> PyTorch Facebook – DistBelief -> TensorFlow Google – SINGA National university of Singapore – CNTK Microsoft – Chainer Preferred Networks – Project Adam Microsoft – MXNet Apache – CoreML Apple – Theano Universite de Montreal

Support Keras

4 What is Keras?

– Keras is a high-level neural networks API, written in Python and capable of running on top of a deferred execution backend.

 User friendly  Modular  Easily extensible

Fig 1. Number of publications during the last decade having the name of the DL platform in their full text1 [1] https://app.dimensions.ai/discover/publication 5 Deep Learning Platforms

– Caffe -> Caffe2 -> PyTorch Facebook – TensorFlow Google

– CNTK Microsoft Imperative or Eager Style – Theano Universite de Montreal – MXNet Apache – CoreML Apple

Deferred Style

– Deferred Execution: it has two distinct phases: the first phase defines the program as a symbolic graph; and the second phase executes an optimized version of the program on the set of available devices.2

[2] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016). Tensorflow: A system for large-scale . In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (pp. 265-283).c 6 Keras different backends

Table 1. Investigating parallelism in the deep learning platforms supported by Keras Platform Data Parallelism Model Parallelism TensorFlow Synchronous or asynchronous Supported using greedy through parameter servers heuristics CNTK Bounded asynchronous through  a parameter server model Theano  Not on multiple nodes

MXNet Synchronous or asynchronous Not on multiple nodes through parameter servers

– “When gradient nodes are automatically added to the graph, the user has less control, and the heuristics may break down. ”2

7

The solution to the problem

Problem: On a single node, training ResNet50 on the ImageNet data set on an M40 GPU takes 14 days!3 Solution: A High-Performance Keras Backend which  Is deferred style; can optimize the expression graph  Is distributed and can run on multiple nodes  Uses asynchronous computations; avoids straggler problem

Let’s use HPX!

[3] Zhang, Z., Yin, L., Peng, Y., & Li, D. (2018, December). A Quick Survey on Large Scale Distributed Deep Learning Systems. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS) (pp. 1052-1056). IEEE. 8 HPX as a backend for Keras

Phylanx

Keras (Python) HPX (C++)

(Python Frontend, C++ Backend)

– Using hints from the user and the optimization step the expression graph will be passed to the HPX runtime which schedules work and infers the data layout on each compute locale arguments.4

[4] http://phylanx.stellar-group.org/ 9 How to implement a Keras Backend Keras related Basic math Up to 4D Not yet backend Late Binding epsilon set_epsilon map_fn   shape int_shape ndim reshape count_params foldl flodr floatx cast_to_floatx eye ones ones_like zeros zeros_like constant Inference Training image_data_format set_floatx dtype cast dot batch_dot transpose gather in_top_k set_image_data_format max min sum prod cumsum cumprod argmax argmin Activations and Losses normalize_data_format var std mean square sqrt abs exp log logsumexp any all equal not_equal greater greater_equal less less_equal relu elu tanh softmax softplus softsign main round sign pow clip maximum minimum permute_dimensions concatenate stack repeat_elements repeat tile sin cos categorical_crossentropy binary_crossentropy name_scope arange flatten expand_dims squeeze one_hot reverse sparse_categorical_crossentropy variable eval slice switch bias_add dropout l2_normalize sigmoid hard_sigmoid get_uid reset_uids random_binomial random_uniform is_sparse to_dense random_normal truncated_normal update update_add update_sub learning_phase set_learning_phase Convolutional Batch related moving_average_update identity batch_get_value batch_set_value in_train_phase in_test_phase pool2d pool3d batch_flatten batch_normalization is_tensor is_keras_tensor conv1d conv2d conv3d normalize_batch_in_training placeholder is_placeholder separable_conv1d separable_conv2d gradients stop_gradient temporal_padding spatial_2d_padding spatial_3d_padding random_uniform_variable conv2d_transpose conv3d_transpose depthwise_conv2d Recurrent random_normal_variable resize_images resize_volumes rnn ctc_decode get_value set_value local_conv1d local_conv2d ctc_label_dense_to_sparse print_tensor ctc_batch_cost function 10 Inference Example 1; Multi-Class Classification fromcorrectskeras= Kimport.equal(backendlabels_predas K, y_test) Using Phylanx backend. correctsfrom keras= K.datasets.cast(corrects,import'intmnist64') y_train shape: (60000,) printfrom(keras"Correct.utilslabelsimport:", Kto_categorical.get_value(corrects)) y_train shape, after one_hot encoding: (60000, 10) import numpy as np class_predict shape: (10000, 10) number_of_correctsimport pandas as pd = K.get_value(K.sum(corrects)) A sample of class_predict: [3.27987540e-37 1.93442800e-25 5.78854500e-25 1.94946260e-21 print("Number of corrects predictions: %d "%number_of_corrects) 3.15305600e-31 1.03375155e-32 0.00000000e+00 1.00000000e+00 4.98417950e-32 3.93246830e-21] corrects(_,y_train=),K(.x_testexpand_dims, y_test)(corrects,= mnist.load_dataaxis=0) () Predicted labels: [7 2 1 ... 4 5 6] num_classesnum_images = lenK.int_shape(np.unique(corrects)[(y_train))1] What we have on y_test [7 2 1 ... 4 5 6] print# convert("Accuracyclass vectors: %.2f%%to"binary% ((number_of_correctsclass matrices *100)/num_images)) Correct labels: [1 1 1 ... 1 1 1] #printMisclassified("y_train shape:", y_train.shape) Number of corrects predictions: 9837 incorrectsy_train = to_categorical= K.not_equal((corrects,y_train, num_classes1) ) Accuracy: 98.37% incorrectsprint("y_train= K.shape,eval(incorrectsafter one_hot) encoding:", y_train.shape) Label 4 is misclassified as 2 Label 2 is misclassified as 7 labels_errordf = pd.read_csv= (lambda('class_predx: x[0.]csv'* x[)1])([K.eval(labels_pred), incorrects]) Label 5 is misclassified as 3 class_predlabels_true==df(lambda.values x: x[0] * x[1])([K.eval(y_test), incorrects]) Label 3 is misclassified as 7 labels_true_sliceprint("class_predict= Kshape.slice(:K".,squeezeclass_pred(K.variable.shape)(labels_true),0),[0],[500]) Label 6 is misclassified as 0 printlabels_error_slice("A sample of=class_predictK.slice(K.flatten:", class_pred(K.variable[0(labels_error]) )),[0],[500]) Label 9 is misclassified as 3 labels_predfor i,j in zip(K=.get_valueK.argmax((class_predlabels_true_slice, axis=1),)K.get_value(labels_error_slice)): Label 8 is misclassified as 2 printif i !=("Predicted0: labels:", K.get_value(labels_pred)) Label 2 is misclassified as 7 printprint("What("Label"we have,i,"isonmisclassifiedy_test", K.evalas",j()y_test)) Label 8 is misclassified as 4 11 Inference Example 2; Sentiment Analysis largest_indexy_scorefrom keras= K.importgather= y_test(backendlabels_pred.size - as1 K, desc_score_indices) Using Phylanx backend. y_truefrom keras= .gather.datasets(y_trueimport, desc_score_indicesimdb ) df = pd.read_csvK ('labels_pred.csv') classes: [0 1] labels_predy_trueimport=numpyK.cast= K(as.y_truesqueezenp , "int(df.64values") , 1) Predicted labels: [9.2614290e-03 9.9999920e-01 9.9997926e-01 printimport("Predicted"y_truepandas", Kas.labelsget_valuepd :", K.(get_valuey_true)) (labels_pred)) ... 6.2763690e-05 3.3009052e-03 6.0482204e-01] [email protected]("y_score(K.get_value", K.get_value(labels_pred(y_score)) )) Accuracy: 86.704 pltdiffdef.show=unique_eagernp.()diff(K.eval(x)(y_score: )) desc_score_indices [12420 1594 2351 ... 11280 13389 18853] correctsdistinct_value_indicesreturn=npK..lessunique(K.abs(x)(=labels_predwhere(K.not_equal- K.variable(diff,(y_test0)) )), .5) y_true [1 1 1 ... 0 0 0] printdistinct_value_indicesunique("Accuracy= Phylanx:",.lazyK.get_value(unique_eager= K.get_value(K.sum()distinct_value_indices(K.cast(corrects,"int32)[")))0]*100/ y_score [1. 1. 1. ... 0. 0. 0.] [email protected]_shapePhylanx("distinct_value_indices(corrects)[0]) ", distinct_value_indices) distinct_value_indices [ 484 593 981 ... 23598 23794 23973] threshold_idxsdef argsort_eager= K(x).eval: (K.concatenate True Positives: [ 483 591 975 ... 12493 12495 12500] y_true([Kreturn.variable= Knp.equal(distinct_value_indices.argsort(y_test(x) , 1.) ), False Positives: [ 2 3 7 ... 11302 11479 12500] K.variable(np.array([largest_index]))], 0)) Decreasing Threshold: [1.0000000e+00 9.9999994e-01 P N #argsortsort scores= Phylanxand corresponding.lazy(argsort_eagertruth) values # accumulate the true positives with decreasing threshold 9.9999990e-01 ... 5.9604645e-08 2.9802322e-08 indices@Phylanx= argsort(labels_pred) tps = K.get_value(K.gather(K.cumsum(y_true), threshold_idxs)) 0.0000000e+00] desc_score_indicesdef where_eager(x)=: K.eval(K.reverse(indices, 0)) P TP FP print("True Positives:", tps) printreturn("desc_score_indicesnp.where(x) ", desc_score_indices) fps = 1 + threshold_idxs - tps where = Phylanx.lazy(where_eager) N FN TN print("False Positives:", fps) (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000) thresholds = K.get_value(K.gather(y_score, threshold_idxs)) TPR=TP/P FPR=FP/N classes = unique(K.variable(y_test)) print("Decreasing Threshold:", thresholds) print("classes:", K.get_value(classes)) plot_roc_curve(tps, fps, thresholds) 12 Keras in future

[5] https://github.com/keras-team/keras/releases 13 TensorFlow Eager (2.0)

– – TensorFlow 1.0 PyTorch – CNTK TensorFlow 2.0 – Theano Imperative or Eager Style – MXNet – CoreML

Deferred Style [6] https://www.tensorflow.org/guide/effective_tf2 14 Performance of TF Eager

Fig 6. Examples per second training ResNet-50 on Fig 7. Examples per second training L2HMC on a a GPU CPU – “We expect most real-world models to fall somewhere between these two, and to be able to recover performance by staging as required.”7 – “TensorFlow Eager is an evolving technology and closing the gap between imperative and staged performance is being worked on.”7

[7] Agrawal, A., Modi, A. N., Passos, A., Lavoie, A., Agarwal, A., Shankar, A., ... & Cai, S. (2019). Tensorflow eager: A multi-stage, python-embedded dsl for machine learning. arXiv preprint arXiv:1903.01855. 15 Where we are now

– We had a good progress on Phylanx backend of Keras  Many of needed primitives are implemented in Phylanx8  BlazeTensor has an acceptable support for 3D and 4D arrays9 – We need higher dimensionalities as in DL platforms we usually add batch of data and channels to the data dimension. – “As one part of the development of TensorFlow, our team has extended the open source Eigen library with support for arbitrary dimensionality tensor operations.”10

 Majority of Keras backend tests are passed11

[8] https://github.com/STEllAR-GROUP/phylanx [9] https://github.com/STEllAR-GROUP/blaze_tensor [10] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. 16 [11] https://github.com/STEllAR-GROUP/keras Thank you for your attention