Arxiv:1612.09007V1 [Stat.ML] 28 Dec 2016 E Erig(K)[,4.Teie St Fetvl Exploit Effectively to Is Idea the 4]

arXiv:1612.09007v1 [stat.ML] 28 Dec 2016 e erig(K)[,4.Teie st fetvl exploit effectively to is idea The 4]. [3, K Multiple (MKL) as Learning the to nel referred minimizing is and objective) (SVM combination inferrin risk convex simultaneously structural the of for process weights The the kernels. combina- the convex of a consider tion to is straightforwar strategy is adopted modalities) commonly sensing or differen descriptors (e.g. ture sources multiple from methods trans kernels fusing kernel the that of is in property important interest Another of space. function formed any effi- of enabling approximation thereby cient (RKHS), Space Hilbert Re Kernel a to semidefinit ducing (transformation) positive lifting valid a defines A inherently unsuper- kernel 2]. suc- and [1, been supervised problems of has learning range approach vised wide this a matrix. trick, to similarity applied kernel kernel cessfully the the of as terms to in Referred problem dual repo a be into can boundaries decision non-linear clas- obtain binary to Support building sifiers in of problem example the For (SVM), Machines Vector models. de non-linear the effective enable of they sign since formulations sev- learning extend machine to eral framework powerful a provide methods Kernel recognition Activity regularization, fusing in effective is performance. architecture state-of-the-art achieves and proposed kernels data the recognition Ex- activity that kernels. real-world show composition a of on set we results expanded periment an network, of coup this use strategy the of regularization with effectiveness dropout kernel the the improve introduce embeddings to the order fusing In for architecture network adopt and neural similarities deep kernel dens the creates using that data for approach th embeddings alternative In an propose procedures. we paper, optimization ker- sophisticated the through combining for nels parameters the learn Learni to Kernel aspe attempt Multiple (MKL) different for com- approaches characterize Traditional for data. that approach of effective features and multiple popular bining a is fusion Kernel hsrsac a upre npr yteSnI center. SenSIP the by part in supported was research This ne Terms Index ⋆ B ..Wto eerhCne,10 icaa od Yorkt Road, Kitchawan 1101 Center, Research Watson T.J. IBM — EPLANN PRAHT UTPEKRE FUSION KERNEL MULTIPLE TO APPROACH LEARNING DEEP A .INTRODUCTION 1. ‡ enlfso,De erig Dropout learning, Deep fusion, Kernel arneLvroeNtoa as 00Es vne Liverm Avenue, East 7000 Labs, National Livermore Lawrence ABSTRACT unSong Huan † atiea aea Ramamurthy Natesan Karthikeyan eSPCne,EE,AioaSaeUiest,Tme AZ Tempe, University, State Arizona ECEE, Center, SenSIP † aaaa .Thiagarajan J. Jayaraman , fea- t .A d. pro- sed led set cts er- a s ng is g e e - - . hw xetoa oe hndaigwt ope,high- have complex, with learning dealing re deep when sophisticated power as particular, exceptional such shown In paradigms ro- learning be outliers. resentation can and representations noise compact to such bust fusing fe since raw from tures representations effective and compact building reduc promotes vectors. and support of data number in char localities to underlying able is the inpu function the terize gating the treating variable, By a as kernel. challenge sample each data This for intro- function which gating space. [5], a input (LMKL) duces MKL whole Localized the using alleviated over is restri same is the kernel a be for to weight crit- the most since the vectors, and support weights ical global algorithms for MKL solving [5], when suffer by Despi may out pointed the as functions. use, and kernel wide-spread their features different different of the power of representation nature complementary the dif of optimizatio use the improves the kernels and (sum) layer kernel composition The fusion ferent network. the deep at data connected regularization for fully dropout embeddings a in dense them create fuse to and matrices kernel the use 1 Fig. eea xsigapoce o etr uinbgnby begin fusion feature for approaches existing Several ⋆ n nra Spanias Andreas and rpsdapoc o utpekre uin We fusion. kernel multiple for approach Proposed : Embedding Kernel 1 Kernel ‡ Dense rsnaSattigeri Prasanna , Representation Learning Representation Fusion Fusion Layer with Kernel Dropout Embedding Softmax w egt,NY Heights, own Kernel Dense † r,CA ore, M ⋆ , Composition KernelsComposition Embedding Dense cted ac- n. ed p- a- te - t dimensional data. In [6], the authors focused on feature learn- sample xj and all other samples xi and it can be treated as ing for different multimodal settings and showed that in the an embedding for xj . This viewpoint is very similar to the multimodal fusion case, the fused feature exploits comple- construction of dense word embeddings using the PMI in text mentary information from each modality. In [7], Zhao et.al. processing [11]. In the ideal case, sj has large values for the built sub-networks for each heterogeneous feature and relied samples that come from the same class with xj and zeros for on the Stacked Denoising Autoencoders to learn high-level other samples. The sparsity in these embeddings makes them homogeneous representations for feature integration. Note unsuitable for inference tasks [10]. To alleviate this, we ob- that, both methods start with the raw features directly and did tain a dense embedding of the kernel similarities using Prin- not exploit the expressive power of similarity kernels. Exist- ciple Component Analysis (PCA), which projects the origi- ing works on incorporating the advantages of deep learning nal kernel feature to a low-dimensional space. Note that, this into kernel methods either develop novel kernel constructions can be easily replaced by other dense embedding techniques to mimic the large neural computation [8] or apply similar including manifold learning [12], Word2Vec [13] or random neural network structure to combine kernels and optimize at projection [14]. Besides providing dense embeddings, this each layer [9]. step also helps to significantly improve the network training In this paper, we proposeto exploit the advantagesof deep speed. architectures in feature learning to build a new approach to On top of each dense feature set obtained by PCA, we multiple kernel fusion. First, we adopt a novel viewpoint build a fully connected neural network. The goal is to use to kernels by treating the similarities encoded in the kernel back-propogation in a large network to learn a concise repre- matrix as a valid embedding of the data. This is similar to sentation which will be more effective for inference tasks. To the approaches in the natural language processing literature, achieve this, the size of the network needs to be adequately wherein relevance measures such as Pointwise Mutual Infor- large. In our application, we build a 4−layer network sepa- mation (PMI) of a word with respect to other words in the vo- rately for each embedding. At each hidden layer, dropout reg- cabulary is treated as a word embedding [10]. Since the ker- ularization [15] is used to prevent overfitting and batch nor- nel matrix can be inherently sparse and low-rank, we propose malization [16] to accelerate training. After the representa- to apply an additional dense embedding layer (e.g. Singular tions are learned, we stack another layer which is responsible Value Decomposition) to the columns of the kernel matrix. for fusing the features and obtaining the classification result Consequently, the problem of kernel fusion is transformed to with a softmax activation. The most straightforward approach fusing their dense embeddings. To this end, we build a deep for the feature merging is to simply concatenate all the inputs architecture for kernel fusion, coupled with novel training to the layer. However various other merge modes can be eas- strategies: (a) to emulate the convex combination approach in ily applied too including summation, averaging, multiplica- MKL, we expand the set of input kernels by considering com- tion etc. The flexibility of the merge layer facilitates a wide binations of different subsets of the base kernels, and (b) we range of kernel combination forms. perform kernel dropout in the fusion layer for improved regularization. The proposed architecture replaces the complex 2.2. Using Composition Kernels optimization procedure in MKL by efficient representation learning and straightforward feature merging. This makes our An important property of MKL is the various parameteriza- fusion approach easily scalable to a large number of kernels. tion forms for mixing kernels such as convex combination, Hadamard product or mixtures of polynomials [3]. We emulate this property by including all possible combinations 2. PROPOSED APPROACH of base kernels (namely the composition kernels in Fig- ure 1) to the architecture input. Given the base kernels set 2.1. Architecture {K1, K2, ..., KM }, the whole input kernel set Φ to our architecture will have size M˜ : The proposed approach considers the similarity information q encoded in a kernel as an embedding of the data, and poses K K K the problem of MKL as fusing these embeddings in a deep Φ= { | = i, ∀p, q ∈{1, 2, 3, ..., M}, q ≥ p} learning architecture. In this section, we start by presenting Xi=p M the general architecture and then describe strategies for im- M M˜ = proving the performance. m mX=1 As shown in Figure 1, the proposed architecture consists of three components: (a) obtain dense embeddings for data Simple kernel summation proves to be highly effective in using kernel similarities, (b) representation learning using a practical recognition problems and the derived representation deep architecture, and (c) feature fusion. Let us denote the corresponding to the summed kernel is often very different kernel Gram matrix as K, where Ki,j = k(xi, xj ). Each from either alone. Note that other formulations with base column sj of the Gram matrix encodes the relevance between kernels can also be used. Paired with the flexible merge and deep feature representation, our architecture covers a large 3.1.2. Shape Kernel number of kernel combination scenarios without explicitly formulating them. In lieu of building conventional state-space models, Time De- lay Embeddings (TDE) provide an effective way to recon- struct the underlying dynamical system from the observed 2.3. Kernel Dropout Regularization data. Given a time-series data, the phase space is the set of In dropout regularization [15] for training large neural net- states which contain all the necessary information to predict works, neurons are randomly chosen to be removed from the the future of the system [18]. The TDEs of a time-series network along with their incoming and outgoing connections. data x can be defined in matrix form O whose ith column The process can be viewed as sampling a large set of pos- is ot = [xt, xt+τ , xt+2τ , ..., xt+(n−1)τ ]. sible network architectures with shared weights. Given the The n time-delayed observation samples can be consid- large kernel set Φ, a more effective regularization mechanism ered as points in Rn, which is referred to as the delay em- is needed to prevent the network training from overfitting cer- bedding space. In our application, the delay parameter τ is tain kernels. More specifically, we propose to regularize the fixed to 10 and embedding dimension n to 8. Following the fusion layer by dropping the entire representations learned approach in [18], we use PCA to project the embedding to 3- from some randomly chosen kernels. Denoting the represen- D for noise reduction. We extract a simple shape function ˜ based on the geometric distances, and use it to derive our tations learned for all kernels as Φ = {r1, r2, ..., rM˜ } and a vector t associated with M˜ independent Bernoulli trials, the feature. The shape function we consider measures the pair- representation rm is dropped from the fusion layer if tm is 0. wise distances between samples in the TDE space, calculated The feed-forward operation can be expressed as: as Sij = koi − oj k2 [19]. A histogram is calculated using these distances with a pre-specified bin size to build the tm ∼ Bernoulli(p) feature. Following this, we construct an intersection kernel ′ ′ ′ Φ=ˆ {r | r ∈ Φ˜ and tm > 0} k(h, h ) = i min(hi,hi) [20], where h,h are the com- puted histograms.P ˜r = (ri), ri ∈ Φˆ z = w ˜r + b ,y = f(z ) i i i i i 3.1.3. Correlation Kernel where w are the weights for hidden unit i, (·) denotes vector i Correlation measures the dependence between two time- concatenation and f is the softmax activation function. series signals and has been widely used in electroencephalo- gram (EEG) signal analysis. We calculate the absolute 3. SYSTEM SETUP value of Pearson correlation coefficient. To account for shift between the two signals, the maximum absolute co- In this section, we apply the proposed architecture to the im- efficient for a range of shift values is identified. The cor- portant problem of sensor-based activity recognition. Recent relation matrix R defined in this way does not guarantee advances in activity recognition have shown promising results the required positive semi-definite condition of kernel. To in the applications of fitness monitoring and assistive living correct this, we remove the negative eigenvalues from the [17]. However, problem still exists on how to effectively deal matrix. Given the eigen-decomposition of the correlation with the measurement inaccuracy and noise. One popular ap- T matrix R = QΛQ , where Λ = diag(λ1, λ2, ..., λn) and proach to the problem is utilizing various features and kernels λ1 ≥ λ2 ≥ ... ≥ λr ≥ 0 ≥ λr+1 ≥ ... ≥ λn, the that characterize salient aspects of the data and develop effi- correlation kernel is constructed as K = QΛQˆ T , where cient fusion mechanisms to combine them. In this paper, we ˆ Λ = diag(λ1, λ2, ..., λr, 0, ..., 0). construct kernels which describe the statistical property, peri- odic structure and inter-sample relations for the accelerometer signals. 3.2. Dataset The dataset used in our experiments is obtained from [21] 3.1. Feature Extraction and Kernel Construction and corresponds to 12 different daily activities for 14 sub- jects. Each activity is repeated in trials for each subject. 3.1.1. Statistics Kernel 5 The 3-axis accelerometer measurements were obtained at Statistical features have been known to be useful for activity a sampling rate of 100Hz. We consider 5 seconds of non- recognition [17]. The features we use include mean, median, overlapping frames and as a result there are 5353 frames. In standard deviation, kurtosis, skewness and total acceleration. our experiment, 80% randomly chosen samples were used for In addition, we extract the mean-crossing rate and dominant training and the rest for testing. Putting together the 3 base frequency to capture the frequency-domain information. We kernels described in Section 3.1 with the combination kernels construct a Gaussian kernel and the best γ parameter is deter- (K1 + K2, K1 + K3 etc.) makes the total number of input mined by cross validation on the training set. kernels for our architecture to 7. 1.6 Table 1: Classification Performance Comparison

Input Kernel Accuracy (%) 1.2 Statistics 79.3 Shape 73.3 0.8 Correlation 75.3 Loss Value Architecture Accuracy (%) 0.4 (a): Standard Feature Fusion 82.3

0 (b): (a) + Dense Embedding 86.6 0 100 200 300 400 500 Number of Epochs (c): (b) + Composition Kernels 88.1 MKL Method Accuracy (%) Fig. 2: Convergence behavior of the proposed method. UNIFORM 88.5 SMO-MKL [4] N/A SwMKL [22] 88.5 Proposed: (c) + Kernel Dropout 90.2

learned representations). Dense embedding gives around 4% improvement. This demonstrates the necessity of this pre- processing stage when treating kernel values as embeddings. The inclusion of composition kernels and kernel dropout regularization each provides further improvements. Although the improvement is not tremendous in this case, it is significant. We argue that each step is beneficial and expect much more usefulness of them in more complex problems when a large Fig. 3: Confusion matrix based on the classification result. number of descriptors and kernels are needed. From the vi- sualization of confusion matrix in Figure 3 we can see the 4. PERFORMANCE EVALUATION AND overall classification model is highly effective to this problem CONCLUSION and most of the confusion happens only between very related activities (e.g. elevator up versus elevator down). The proposed approach is tested on the described activity We compare our approach to MKL methods including recognition dataset. In the dense embedding stage, the di- combination with uniform weights (denoted as UNIFORM), a mension of the kernel feature is reduced to 500. The 4-layer popular MKL algorithm SMO-MKL [4] and a recent LMKL neural network for representation learning has size 256-1024- approach SwMKL [22]. UNIFORM provides a decent perfor- 512-64. At each hidden layer the dropout rate is fixed at 0.5. mance and this justifies our utilization of the composition In the fusion layer, the kernel dropout rate is set to 0.6. We kernels. SMO-MKL applies to binary classification natively use Keras library with TensorFlow backend to build and and as pointed out by [23], the extension to multi-class clas- perform optimization for the architecture. sification is not trivial. We find this to be true for many The training convergencecurve shown in Figure 2 demon- existing MKL formulations. SwMKL relies on a regression strates that the architecture is able to reduce the loss value method to learn the gating function which characterizes the and achieve convergence quickly. We report the classification discriminative capabilities of kernels on local data regions. performance in Table 1. In our case, the accuracy is defined However, in our case each base kernel classifies training data as the averaged fraction of correctly predicted labels for all fairly well, causing a highly imbalanced regression problem. classes. We make comparison of our proposed architecture to This prevents the Support Vector Regressor from obtaining other 3 setups: (1) single kernel performance, which is ob- a meaningful gating function, thereby resulting in a perfor- tained without the fusion layer, (2) different deep architecture mance similar to that of UNIFORM fusion. The proposed settings, and (3) existing MKL methods. approach achieves the best performance and more impor- First, we observe that all 3 kernels achieve accuracies in tantly, provides a reliable way to fuse a large number of a similar range, while the fusion of them provides a signifi- kernels in a multi-class setting using powerful numerical and cant improvement. In the best case, the improvement is over computational backends that are available for generic neu- 10% compared to the best of single kernel. Second, we com- ral networks. The architecture is also general so that more pare each of the proposed training strategies to the standard advanced techniques can be easily incorporated at certain deep architecture feature fusion (by simple concatenation of stages. ertain stages. 5. REFERENCES in 2004 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR). IEEE, 2004, vol. 2, pp. II– [1] Shuicheng Yan, Dong Xu, Benyu Zhang, and Hong- 681. Jiang Zhang, “Graph embedding: A general framework for dimensionality reduction,” in 2005 IEEE Conference [13] T Mikolov and J Dean, “Distributed representations on Computer Vision and Pattern Recognition (CVPR). of words and phrases and their compositionality,” Ad- IEEE, 2005, vol. 2, pp. 830–837. vances in neural information processing systems, 2013. [14] Ella Bingham and Heikki Mannila, “Random projection [2] AndréElisseeff and Jason Weston, “A kernel method in dimensionality reduction: applications to image and for multi-labelled classification,” in Advances in neural text data,” in 7th SIGKDD International Conference on information processing systems, 2001, pp. 681–687. Knowledge Discovery and Data Mining. ACM, 2001, pp. 245–250. [3] Ashesh Jain, Swaminathan VN Vishwanathan, and Manik Varma, “Spf-gmkl: generalized multiple kernel [15] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, learning with a million kernels,” in 18th SIGKDD Inter- Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: national Conference on Knowledge Discovery and Data a simple way to prevent neural networks from overfit- Mining. ACM, 2012, pp. 750–758. ting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [4] Zhaonan S., Nawanol A., Manik V., and Svn V., “Mul- tiple kernel learning and the smo algorithm,” in Ad- [16] Sergey Ioffe and Christian Szegedy, “Batch nor- vances in neural information processing systems, 2010, malization: Accelerating deep network training by pp. 2361–2369. reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [5] M. Gönen and E. Alpaydin, “Localized multiple kernel learning,” in Proceedings of the 25th international con- [17] Mi Zhang and AlexanderA Sawchuk, “Human daily ac- ference on Machine learning. ACM, 2008, pp. 352–359. tivity recognition with sparse representation using wearable sensors,” IEEE journal of Biomedical and Health [6] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Informatics, vol. 17, no. 3, pp. 553–560, 2013. Nam, Honglak Lee, and Andrew Y Ng, “Multimodal [18] J. Frank, S. Mannor, and D. Precup, “Activity and gait deep learning,” in Proceedings of the 28th ICML-11, recognition with time-delay embeddings.,” in AAAI. 2011, pp. 689–696. Citeseer, 2010. [7] Lei Zhao, Qinghua Hu, and Yucan Zhou, “Hetero- [19] V. Venkataraman and P. Turaga, “Shape descriptions geneous features integration via semi-supervised multi- of nonlinear dynamical systems for video-based infer- modal deep networks,” in International Conference on ence,” IEEE Transactions on Pattern Analysis and Ma- Neural Information Processing. Springer, 2015, pp. 11– chine Intelligence, vol. PP, no. 99, pp. 1–1, 2016. 19. [20] Subhransu Maji, Alexander C Berg, and Jitendra Malik, [8] Youngmin Cho and Lawrence K Saul, “Kernel methods “Classification using intersection kernel support vector for deep learning,” in Advances in neural information machines is efficient,” in 2008 IEEE Conference on processing systems, 2009, pp. 342–350. CVPR. IEEE, 2008, pp. 1–8.

[9] Eric V Strobl and Shyam Visweswaran, “Deep multi- [21] Mi Zhang and AlexanderA Sawchuk, “Usc-had: a daily ple kernel learning,” in 2013 12th ICMLA. IEEE, 2013, activity dataset for ubiquitous activity recognition using vol. 1, pp. 414–417. wearable sensors,” in Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012, pp. [10] Omer Levy and Yoav Goldberg, “Neural word embed- 1036–1043. ding as implicit matrix factorization,” in Advances in [22] Raghvendra Kannao and Prithwijit Guha, “Tv neural information processing systems, 2014, pp. 2177– news commercials detection using success based lo- 2185. cally weighted kernel combination,” arXiv preprint arXiv:1507.01209, 2015. [11] Gerlof Bouma, “Normalized (pointwise) mutual information in collocation extraction,” Proceedings of GSCL, [23] Alexander Zien and Cheng Soon Ong, “Multiclass mul- pp. 31–40, 2009. tiple kernel learning,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, [12] Ahmed Elgammal and Chan-Su Lee, “Inferring3d body pp. 1191–1198. pose from silhouettes using activity manifold learning,”