Thesis

Nonlinear transform learning: model, applications and algorithms

KOSTADINOV, Dimche

Abstract

Les principes de la modélisation de non-linéarités sont essentiels pour maints problèmes de la vie réelle. Leur traitement joue un rôle central et influence non seulement la qualité de la solution, mais aussi la complexité computationnelle et les gains dans les compromis possiblement impliqués, qui sont tous hautement demandés dans une variété d’applications, comme la prise du contenu des empreintes digitales active, la reconstitution des images, l’apprentissage supervisé et non-supervisé des représentations discriminatives pour des tâches de reconnaissance d’image et les méthodes de regroupement. Dans la thèse présente un modèle de transformation non-linéaire généralisé novateur est proposé et étudié. Notre intérêt principal et élément de base est la transformation non linéaire exprimée par une double opération qui consiste en une modélisation linéaire suivi d’une non-linéarité par éléments. Pour ce faire, selon l’application considérée, des interprétations probabilistes sont développées et des généralisations et des cas particuliers sont proposées et [...]

Reference

KOSTADINOV, Dimche. Nonlinear transform learning: model, applications and algorithms. Thèse de doctorat : Univ. Genève, 2018, no. Sc. 5335

URN : urn:nbn:ch:unige-1185338 DOI : 10.13097/archive-ouverte/unige:118533

Available at: http://archive-ouverte.unige.ch/unige:118533

Disclaimer: layout of this document may differ from the published version.

1 / 1 UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’Informatique Professeur S. Voloshynovskiy

Nonlinear Transform Learning: Model, Applications and Algorithms

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Dimche Kostadinov de Strumica (Macedonia)

Thèse no 5335

GENÈVE Repro-Mail - Université de Genève 2018

NONLINEAR TRANSFORM LEARNING: MODEL, APPLICATIONS AND ALGORITHMS

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF UNIVERSITY OF GENEVA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Dimche Kostadinov May 2019 c Copyright by Dimche Kostadinov 2019 All Rights Reserved

ii In memory of my father To all that I care about with love and eternal appreciation

Acknowledgements

I would like to thank my supervisor Prof. Sviatoslav Voloshynovskiy for providing me op- portunity to work on this PhD Thesis, all his encouragement, considerations and involvement at all times, the discussions, the insights and all the rest of his support. I would like to thank my jury members Prof. Karen Egiazarian, Prof. Teddy Furon, Prof. Sylvain Sardy and Prof. Stéphane Marchand-Maillet for their careful reading, valuable suggestions and comments. I would like to thank Taras Holotyak for his time spend in the discussions and reading of my draft concepts as well as providing valuable comments. I would like to thank Sohrab Ferdowsi for taking the time and participating in discussions regarding my presentations, elaborations and outlines of many of my ideas on the board in our office. I’m thankful to Maurits Diephuis for been involved in providing comments on the English writing over many papers. In addition, I would like to thank Behrooz Razeghi for his interests and enthusiasm in some of the concepts over several works and getting involved in providing comments towards certain details and clarifications. I want to mention all of the rest of my colleagues from our Stohastic Information Processing Group, who have directly or indirectly provided a support in terms of discussions, comments and suggestions related to the work in this Thesis. I would like to thank our head of the Computer Vision and Multimedia Laboratory Prof. Thierry Pun for outlying his supportive attitude and enabling a great working environment, as well as Prof. Stéphane Marchand-Maillet and Prof. Alexandros Kalousis for providing me with insights and suggestions related to some machine learning aspects. I would like to thank Fokko Beekhof and Farzad Farhadzadeh for their initial help during my relocation to Geneva. I would like to thank Boris Petrov Lambrev for providing me his help with the translation of the abstract. I would like to thank also the rest of the colleges of the Computer Vision and Multimedia Laboratory with whom I have spend a wonderful time, made professional as well as personal bonds. I’m thankful to many of the colleges with whom I had many coffee breaks with interesting and re-energizing conversations. I would like to thank Edgar Francisco Roman, Sohrab Ferdowsi, Ke Sun, Majid Yazdani and Michal Muszynski for the delightful accompany and the great hang outs. Finally, I would like to thank my mother, my two sisters and their families for their never-ending support and I would like to thank to my wife.

Abstract

Modeling of nonlinearities is essential for many real-world problems, where its treatment plays a central role and impacts not only the quality of the solution but also the computational complexity. Its high prevalence impacts on a variety of applications, including active content fingerprinting, image restoration, supervised and unsupervised discriminative representation learning for image recognition tasks and clustering. In this thesis, we introduce and study a novel generalized nonlinear transform model. In particular, our main focus and core element is on the nonlinear transform that is expressible by a two-step operation consisting of linear mapping, which is followed by element-wise nonlinearity. To that end, depending on the considered application, we unfold probabilistic interpretations, propose generalizations, extensions and take into account special cases. An approximation to the empirical likelihood of our nonlinear transform model provides a learning objective, where we not only identify and analyze the corresponding trade-offs, but we give information-theoretic as well as empirical risk connections considering the addressed objectives in the respective problem formulations. We introduce a generalization that extends an integrated maximum marginal principle over the approximation to the empirical likelihood, which allows us to address the optimal parameter estimation. In this scope, depending on the modeled assumptions w.r.t. an application objective, the implementation of the maximum marginal principle enables us to efficiently estimate the model parameters where we propose an approximate and exact closed form solutions as well as present iterative algorithms with convergence guarantees. Numerical experiments empirically validate the nonlinear transform model, the learning principle, and the algorithms for active content fingerprinting, image denoising, estimation of robust and discriminative nonlinear transform representation for image recognition tasks and our clustering method that is preformed in the nonlinear transform domain. At the moment of thesis preparation our numerical results demonstrate advantages in comparison to the state-of-the-art methods of the corresponding category, regarding the learning time, the run time and the quality of the solution.

Résumé

Les principes de la modélisation de non-linéarités sont essentiels pour maints problèmes de la vie réelle. Leur traitement joue un rôle central et influence non seulement la qualité de la solution, mais aussi la complexité computationnelle et les gains dans les compromis possiblement impliqués, qui sont tous hautement demandés dans une variété d’applications, comme la prise du contenu des empreintes digitales active, la reconstitution des images, l’apprentissage supervisé et non-supervisé des représentations discriminatives pour des tâches de reconnaissance d’image et les méthodes de regroupement. Dans la thèse présente un modèle de transformation non-linéaire généralisé novateur est proposé et étudié. Notre intérêt principal et élément de base est la transformation non linéaire exprimée par une double opération qui consiste en une modélisation linéaire suivi d’une non-linéarité par éléments. Pour ce faire, selon l’application considérée, des interprétations probabilistes sont développées et des généralisations et des cas particuliers sont proposées et considérées. Une approximation à la probabilité empirique de la transformation non-linéaire assure l’objectif d’apprentissage où non seulement les compromis correspondants sont identifiés et analysés, mais les connexions à risque d’un point de vue informative-théorique, ainsi qu’empirique sont proposé en considérant les objectifs adressés dans les formulations respec- tives du problème. L’introduction d’une généralisation qui étend un principe maximal intégré marginal sur l’approximation de la probabilité empirique permet d’adresser l’estimation optimale du paramètre. Dans cet esprit, selon les hypothèses modelées par rapport à un objectif d’application la réalisation du principe marginal maximal, permet d’estimer de manière efficace les paramètres du modèle où des solutions analytiques approximatives et exactes sont proposées, ainsi que des algorithmes itératifs avec des garanties convergentes. Des expériences numériques confirment la validité de notre modèle NT, le principe d’apprentissage, les algorithmes pour la prise du contenu des empreintes digitales active, l’enlèvement du bruit des images, l’estimation d’une représentation de transformation non- linéaire robuste et discriminative pour des tâches de reconnaissance d’image et la méthode de regroupement exécuté dans le domaine de transformation non-linéaire. Lors de la préparation x de la thèse nos résultats numériques montrent des avantages, comparés aux méthodes de pointe correspondants, concernant le temps d’apprentissage, la durée de fonctionnement et la qualité de la solution. Table of contents

List of figures xv

List of tables xix

1 Introduction1 1.1 Scope of the Thesis ...... 3 1.2 Thesis Outline ...... 4 1.3 Main Contributions ...... 7

2 Modeling and Estimation of Nonlinear Transform9 2.1 Sparse Synthesis Model vs Nonlinear Transform Model ...... 9 2.1.1 Sparse Synthesis Model ...... 9 2.1.2 Nonlinear Transform Model ...... 11 2.1.3 Application Perspectives of the NT Model ...... 15

3 Estimation and Learning of NT Based Modulation for ACFP 19 3.1 Passive Content Fingerprinting ...... 19 3.2 Active Content Fingerprint ...... 20 3.3 ACFP with Predefined Linear Feature Map ...... 21 3.3.1 Contributions ...... 22 3.3.2 Reduction to Constrained Projection and Closed Form Solution . . 22 3.3.3 Giving Up Distortion or Exact Feature Descriptor Properties . . . . 24 3.4 Joint ACFP and Linear Feature Map Learning ...... 26 3.4.1 Contributions ...... 26 3.4.2 Problem Formulation ...... 26 3.4.3 Learning Algorithm (AFIL) ...... 27 3.4.4 Local Convergence Analysis ...... 30 3.5 ACFP Using Latent Data Representation, Extractor and Reconstructor . . . 30 3.5.1 Contributions ...... 31 xii Table of contents

3.5.2 ACFP-LR vs ACFP ...... 32 3.5.3 PCFP, ACFP and ACFP-LR: Feature Generation ...... 32 3.5.4 ACFP-LR: Reconstruction from Latent Variables ...... 33 3.5.5 Problem Formulation ...... 33 3.5.6 Reduced Problem Under Linear Feature Extraction and Reconstruction 34 3.6 Computer Simulations ...... 36 3.6.1 Numerical Experiments Setup ...... 36 3.6.2 Measures ...... 37 3.6.3 ACFP ...... 37 3.6.4 AFIL ...... 40 3.6.5 ACFP-LR ...... 40 3.7 Summary ...... 43

4 Learning NT for Image Denoising 47 4.1 Contributions ...... 47 4.2 Related Work ...... 48 4.2.1 Sparse Models for Image Denoising ...... 49 4.3 Learning an Overcomplete and Sparsifying Transform with Exact and Ap- proximate Closed Form Solutions ...... 50 4.3.1 Problem Formulation ...... 51 4.3.2 Two Step Iteratively Alternating Algorithm ...... 52 4.3.3 Local Convergence ...... 56 4.3.4 Image Denoising With εCAT ...... 56 4.4 Numerical Evaluation ...... 58 4.5 Summary ...... 61

5 Learning Robust and Discriminative NT Representation for Image Recognition 63 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition ...... 63 5.1.1 Contributions ...... 65 5.1.2 Related Work ...... 66 5.1.3 Decisions Aggregation Over Local NT Representation ...... 68 5.1.4 Asymptotic Analysis of Computational Cost and Memory Usage . . 73 5.1.5 Numerical Evaluation ...... 75 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition ...... 81 5.2.1 Contributions ...... 82 Table of contents xiii

5.2.2 Related Work ...... 83 5.2.3 Nonlinear Transform Model ...... 84 5.2.4 Discriminative Prior ...... 86 5.2.5 Learning Nonlinear Transform with Priors ...... 88 5.2.6 A Solution by Iterative Alternating Algorithm ...... 92 5.2.7 Evaluation of Algorithm Properties, Discrimination Quality and Recognition Accuracy ...... 95 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition ...... 106 5.3.1 Motivations, NT Model and Learning Strategy Outline ...... 106 5.3.2 Contributions ...... 107 5.3.3 Related Work ...... 108 5.3.4 Target Specific Self-Collaboration Model ...... 109 5.3.5 Joint Learning of NTs with Discrimination Specific Self-Collaboration ...... 114 5.3.6 The Learning Algorithm ...... 115 5.3.7 Evaluation of the Proposed Approach ...... 117 5.4 Summary ...... 122

6 Clustering with NT and Discriminative Min-Max Assignment 123 6.1 Approach Outline and Contributions ...... 124 6.1.1 Joint Modeling of NTs with Priors ...... 124 6.1.2 Simultaneous Cluster and NT Representation Assignment . . . . . 125 6.1.3 Contributions ...... 125 6.2 Related Work ...... 126 6.3 Joint Modeling of Nonlinear Transforms With Priors ...... 127 6.3.1 Nonlinear Transforms Modeling ...... 128 6.3.2 Priors Modeling ...... 130 6.4 Problem Formulation and Learning Algorithm ...... 132 6.4.1 Problem Formulation ...... 132 6.4.2 The Learning Algorithm ...... 132 6.5 Evaluation of the Proposed Approach ...... 136 6.5.1 Data Sets, Algorithm Setup and Performance Measures ...... 137 6.5.2 Numerical Experiments ...... 137 6.6 Summary ...... 139 xiv Table of contents

7 Conclusions 141 7.1 NT Model and IMM Principle ...... 141 7.2 NT for ACFP ...... 141 7.3 NT for Image Denoising ...... 142 7.4 NT for Image Recognition ...... 143 7.5 Clustering with NT ...... 144

Appendix A 147 A.1 Proof of Theorem 1 ...... 147 A.2 Proof of Theorem A.1 ...... 148

Appendix B 151 B.1 Proof for the Global Optimal Solution ...... 151 B.2 Proof for the Approximate ε-Close Closed Form Solution ...... 153

Appendix C 157 C.1 Proof of Theorem 3 ...... 157 C.2 Proof of Theorem 4 ...... 158 C.3 Proposition of Likelihood w.r.t Similarity and Dissimilarity ...... 162 C.4 Proof that Constrained Likelihood Estimation is MAP ...... 162 C.5 Proof Regarding the Discrimination Density Interpretation ...... 163 C.6 Proof of the Implicit Form in Supervised Case and the Closed Form Solution 164 C.7 Proof of the Implicit Form in Unsupervised Case and the Closed Form Solution167

Appendix D 171 D.1 Proof for the Closed Form Solution w.r.t. NT Representation ...... 171 D.2 Proof for the Closed Form Solution w.r.t. Similarity Related Parameter . . . 173 D.3 Proof for the Solution w.r.t. Dissimilarity Related Parameter ...... 174

References 177 List of figures

1.1 The problems addressed in this thesis are based on the unified concept of nonlinear transform modeling and learning...... 3

3.1 Local ACFP framework...... 22 3.2 Local ACFP framework with approximation to the linear map...... 24 3.3 A general scheme for joint ACFP modulation and linear feature map learning. 26 3.4 A general scheme for ACFP-LR using a latent representation, extractor and reconstructor functions...... 31

4.1 The evolution of a) the transform error AX Y 2 , b) the Tr AXYT and ∥ − ∥F − { } its lower bound approximation Tr AG c) the conditioning number and − { } d) the expected mutual coherence µ(A) while learning the transform matrix A on overlapping 8 8 noisy image blocks (equivalently N = 64) from the × Cameraman image, where M was set to 80 and the sparsity level was set to 36. 58 2 AX Y F 4.2 The evolution of the normalized transform error ∥ −L ∥ , where L is the total number of samples xl 1,...,L under a) sparsity levels s 4,10,16,22,28, ∈ { } ∈ { 24,40,46,52,58,64,70 and b) amounts of data expressed in percentage } from the total amount of data while learning the transform matrix A on over- lapping 8 8 noisy image blocks (equivalently N = 64) from the Cameraman × image, where M was set to 80 and the sparsity level was set to 36...... 59

5.1 The illustration of the clustering over local blocks...... 68

5.2 An illustration of the code s j,k construction for subject k at image patch location j...... 72 5.3 An illustration of the recognition based on aggregation over local bag-of- word decisions, that use the local NT representations...... 74 5.4 Recognition results under basic fusion with ℓ1 constrained projection, soft ℓ2 thresholding and hard thresholding and weighted fusion with ℓ1 constrained ℓ2 projection, soft thresholding and hard thresholding...... 77 xvi List of figures

5.5 Recognition results under varying number of training samples and varying number of codebook codes...... 78 5.6 Comparative recognition results using Extended Yale B and AR...... 78 5.7 Comparative recognition results using PUT and FARET data sets...... 79 5.8 Comparative recognition results under random corruption and continuous occlusion...... 79 5.9 An illustration of the idea about our NT transform, where we used different colors to denote the spaces of the data samples from different classes in the original and transform domain. The goal of our NT is to achieve discrim- ination by taking into account a minimum information loss on the linear map and discrimination prior with a discrimination measure defined on the support intersection for the NT representations...... 85 5.10 The evolution of the approximations C = RP (X) and C = DP (X), their ra- 1 ℓ1 2 ℓ1 t tio C1/C2 and the discrimination power log(C1/C2) = I during the learning of the nonlinear transform with transform dimension M = 9100...... 97

5.11 The conditioning number κn(A) = Cn(A) and the expected mutual coher- ence µ(A) for the learned linear map A in the NT at different transform dimensionality M Q...... 97 ∈ 5.12 The approximation C = RP (X) and the discrimination power t on a subset 2 ℓ1 I of the transform data using learned NT at different transform dimensionality M Q...... 99 ∈ 5.13 The recognition results and the discrimination power on the Extended Yale B and MNIST databases, respectively, using a NT with different dimensionality M and linear SVM classifier on top of the transform representation. . 101 2 zc,k Axc,k yc,k 5.14 The expected loss E[ 2] = E[ ∥ − ∥2 ] and the discrimination power ∥ M ∥2 M on the Extended Yale B and MNIST databases, respectively. The trans-

form representation Y is obtained by using a nonlinear transform TP with different dimensionality...... 101

5.15 The expected mutual coherence µ(A) and the conditioning number κn(A) = λmax for the learned transform matrix A at dimensionality M QU ...... 102 λmin ∈ 5.16 The evolution of the discrimination power I t for 100 algorithm iterations on a subset of the transform data using UNT at transform dimension M = 5884.102 5.17 An illustration of the idea about our NT transform with self-collaboration relations that takes discrimination specific objective into account. . . . 110

6.1 An illustration of the cluster assignment based on a similarity measure d(.,.)

between xi and the clusters d j, j = 1,..,8 , i.e., jˆ = 7 = argmin j m(xi,d j). 127 { } List of figures xvii

6.2 An illustration of the proposed simultaneous cluster and NT representation

assignment. qi = Axi is the linear transform representation, y c1,c2 is NT |{ } representation, τ c ,ν c are element-wise nonlinearity parameters with { 1 2 } discrimination role. There are in total of 4 NT representations, determined

by all pairs c1,c2 1,2 1,2 . Simultaneously, the data point xi { } ∈ { } × { } is assigned to cluster index c = 2(c1 1) + c2 = (2 1)2 + 2 = 4 and the − − NT representation is estimated as yi = y 2,2 based on the discriminating |{ } min-max similarity/dissimilarity score...... 128 6.3 The evolution of a) the objective related to the problem of simultaneous cluster and NT representation assignment, b) the expected NT error and c) the expected discrimination min-max functional score per iteration for the proposed algorithm on the ORL [132], COIL [105], E-YALE-B [47] and AR [96] database...... 136

List of tables

2.1 The nonlinear transform model and the applications considered in this Thesis. 17

3.1 The pe using PCFP under varying AWGN noise, varying JPEG compression levels and projective transformation with QF level of 5...... 38

3.2 DWR and pe using varying ACFP modulation under under varying AWGN noise, varying JPEG compression levels and affine transformation with QF level of 5...... 39

3.3 DWR and pe under PCFP using varying AWGN noise, JPEQ quality factor † r and affine transformation with QF=5 for the feature maps F, F , F and FI . 40 3.4 DWR and pe using varying ACFP modulation under varying AWGN noise, JPEQ quality factor and Projective transformation with QF=5 for the feature

maps F and FI ...... 41 3.5 DWR and pe using varying ACFP modulation under varying AWGN noise, JPEQ quality factor and affine transformation with QF=5 for the feature maps F, F† and Fr ...... 42

3.6 DWR and pe using varying ACFP modulation under varying AWGN noise, JPEQ quality factor and affine transformation with= QF 5...... 43

3.7 DWR and pe of under different Additive White Gaussian Noise (AWGN) level. 44 3.8 DWR and pe under different JPEQ Quality Factor (QF)...... 44 3.9 DWR and pe under JPEQ quality factor QF=5 and affine transformation. . 45 3.10 DWR and pe for ACFP-LR under affine transform and extremely low QF and high AWGN levels...... 45

4.1 Denosing performance in PSNR, where σ is the noise standard deviation. . 60 4.2 The execution time in minutes and the percentage of the used image data. . 60 4.3 The PSNR for the εCAT algorithm learned on percentage of the available noisy image data with noise level σ = 10...... 61

5.1 Computational complexity in big O(.) notation ...... 75 xx List of tables

5.2 Memory usage ...... 75

5.3 a) The conditioning number κn(A) and the expected mutual coherence µ(A) for the learned linear map A in the NT, and the execution time te[min] in minutes of the proposed algorithm for 28 iterations using NT with dimen- sionality M = 19000 . b) The discrimination power in the original domain I O, after a transform with random linear map I RT , after using learned sparsifying transform I ST and after using learned NT I NT ...... 98 5.4 a) The conditioning number κn(A) and the expected mutual coherence µ(A) for the learned linear map A in the NT, and the execution time te[min] in minutes of the proposed algorithm for 28 iterations using NT with dimen- sionality M = 19000 . b) The discrimination power in the original domain I O, after a transform with random linear map I RT , after using learned sparsifying transform I ST and after using learned NT I NT ...... 99 5.5 a) The discrimination power I for the methods DLSI[118], FDDL [162], COPAR [148], LRSDL [149] and the proposed NT, and the recognition results using a nonlinear transform with different dimensionality M and linear SVM classifier on top of the transform representation for the Extended Yale B and MNIST database...... 100 5.6 The recognition results and the learning time in hours on all database, respec- tively. We show the k-NN accuracy on the original data (OD) representation and the k-NN accuracy on the NT representation where the UNT is learned in the unsupervised case and has dimension M = 5884...... 103 5.7 a) The discrimination power for the sparse representations of the methods DLSI [118], FDDL [162], COPAR [148] and LRSDL [149] and the pro- posed method UNT, b), c) The recognition results on the Extended Yale B and MNIST for the methods DLSI [118], FDDL [162], COPAR [148] and LRSDL [149] compared to the kNN results on the UNT representations learned in the unsupervised setup...... 103 5.8 Recognition accuracy comparison between state-of-the-art methods and 1) K Nearest Neighbor kNN search and 2) linear SVM [61](l-svm) that use the Sparsifying Nonlinear Transform (sNT) representations from our model on extracted HOG [32] image features. We use our algorithm to learn the model on the HOG features. Then we get the sNT representations with dimensionality 7300 for the respective training and test sets. The training sNT representations are used to estimate the SVM parameters and the recognition is performed using the learned SVM on the test sNT representations. . . . 104 List of tables xxi

1 5.9 The cumulative expected mutual coherence L ∑l µ(Al) and the cumulative 1 conditioning number ∑ κn(Al) for the linear maps Al,l 1,...,6 with L l ∈ { } dimensions 6570 N, where N is the dimensionality of the input data . . . 117 × 5.10 The learning time in hours on the databases AR, YALE B, COIL20 and NORB using our model with dimension LM = 6570, number of self-collaboration components L = 9, and dimension per self-collaboration component M = 730 ...... 118 5.11 The discrimination power in the original domain, after random transform, after learned sparsifying transform and after learned self-collaborating target specific nonlinear transform with dimension M = 6570...... 119 5.12 The recognition results on the databases AR, YALE B, COIL20 and NORB, using k-NN on the raw image data (raw) and the sparse representations from our model (p) with dimension M = 6570...... 119 5.13 The discrimination power and the recognition results on the Extended Yale B and MNIST databases for the methods DLSI[118], FDDL [162], COPAR [148], LRSDL [149], the proposed model on raw image data p and the proposed model on extracted HOG [32] image features HOG-p...... 120 5.14 Recognition accuracy comparison between sota and 1) k Nearest Neighbor (k-nn) search and 2) linear SVM [61](l-svm) that use the Sparsifying NT (sNT) representations from our model on extracted HOG [32] image features. We use our algorithm to learn the model on the HOG features. Then we get the sNT representations with dimensionality 9800 for the respective training and test sets. Considering the obtained result for database SVHN, we note that the unlabeled training data from the respective database was not used during the learning of the corresponding model...... 121

6.1 The computational efficiency per iteration t[sec] for the proposed algorithm,

the conditioning number κn(A) and the expected mutual coherence µ for the liner map A...... 137 6.2 The clustering performance over the databases COIL, ORL, E-YALE-B and AR evaluated using the Cluster Accuracy (CA) and the Normalized Mutual Information (NMI) metrics...... 138 6.3 A comparative results between state-of-the-art [89], [168], [164], [54] and [73], and the proposed method ( )...... 138 ∗ 6.4 The k-NN accuracy results using assigned NT representations and original data (OD) representation...... 138

Chapter 1

Introduction

"A perspective from a transform trough projection, one may see it as a formal notion, others, as the essence of life..."

D. Kostadinov

Nowadays, in many areas, including signal processing, machine learning, artificial intelligence, computer vision, etc., due to the inevitable imperfections in the data acquisition process, a commonly encountered issue is the presence of data uncertainty in a form of noise or data variability. To that end, essential are convenient data representations, mappings and transforms that allow to qualitatively and efficiently process, analyze, modulate, recognize, classify and cluster the data. In general, we can distinguish two main types of mappings and transforms. The first type characterizes transforms that when applied to the data introduce small changes with respect to a defined constraints. This approach covers a range of applications inthecase when content modulation is appropriate, prior to the content distribution/reproduction such as content authentication, identification and recognition. The second type describes transforms that do not introduce changes to the original data, but when applied to the data result in transform representations that satisfy task specific objectives. Commonly, the latter type of transforms is widely used in applications like image denoising, recognition, classification and clustering. Nonetheless, in both cases, one aims to efficiently express the original data representation with an other convenient data representation, which can result from:

- A carefully and appropriately chosen analytic and predefined transform or - Data adaptive learned transform. 2 Introduction

The advantage of the latter compared to the former is an ability to adapt to the given data, since for the use of the former a more strict statistical data properties have to be known in advance, which might be restrictive for the practical usability. Nonetheless, in order to accordingly model and allow efficient estimation of a task- relevant, useful and information preserving transform representation that satisfy certain properties, usually, a prior knowledge in the form of more "loose" assumption has to be taken into account. One of the fundamental concepts that was widely exploited in the past decade, addressing data adaptive processing and data analysis, is the sparse data representation. N N M That is, given a data sample x ℜ and a set of vectors D = [d1,...,dM] ℜ (formally ∈ ∈ × known as a frame1), a sparse representation y ℜM for x over D is one that uses a sparse N ∈ (small) number of vectors di ℜ from D to represent x. ∈ Although sparsity is crucial for the modeling and solving of many inverse problems that are encountered across different signal processing, machine learning and artificial intelligence tasks, the sparsity assumption alone is not enough to encompass the full extend of requirements in applications like active content fingerprinting (ACFP) and recognition. On the other hand, even if we relay only on the mostly used synthesis [2] sparse model, a disadvantage might be the computational complexity, since the synthesis sparse model can have high computational complexity when the input data dimension or the sparse representation dimension is high. To address the aforementioned challenges, extensions and alternative models have to be taken into account, and additional priors and assumptions on the representation properties have to be considered, modeled and explored in order to fulfill task specific demands like:

- Low complexity estimate - Optimal trade-offs - Robustness - Low estimate variation w.r.t. a task specific objective - Discrimination.

All which are very important for the active content fingerprinting (ACFP), image denoising, estimation of discriminative representations in image recognition/classification tasks and the clustering methods. In this thesis, to address the above open issues:

- We introduce a novel generalized nonlinear transform (NT) model - We demonstrate the usefulness of the NT model across several applications.

1A set of M orthonormal vectors with vector dimensionality N equal to M is said to form a basis set for that vector space. A frame of an inner product space is a generalization of a basis of a vector space to sets that may be linearly dependent. 1.1 Scope of the Thesis 3

Fig. 1.1 The problems addressed in this thesis are based on the unified concept of nonlinear transform modeling and learning.

Our model allows not only to address different task specific constraints on the transform rep- resentations, but also offers a probabilistic interpretation and provides information-theoretic connections, as well as enables the considered approximation of the model log likelihood to be related to the empirical risk in the corresponding learning objective. Our parametric model will varay depending on the different assumptions and prior constraints, which are application driven. At the basic level, a common component is the low complexity nonlinear transform, which is expressible by a linear mapping that is followed by an element-wise nonlinearity. Regarding the estimation of the NT representation, the key difference of our NT model compared to the commonly used synthesis model with constraints is that we do not explicitly address the reconstruction of the data by a sparse linear combination. Rather, we address a constrained projection problem and estimate the NT representation as its solution. Our approach has number of advantages that will be presented and explained in this Thesis.

1.1 Scope of the Thesis

In the scope of this Thesis, using special cases and extensions of our nonlinear transform model, we address:

- The active content fingerprinting (ACFP) problem - The image denoising, as one particular representative of the restoration problems 4 Introduction

- The estimation of sparse and discriminative NT representations useful for image recognition/classification tasks - Nonlinear transform domain clustering.

The addressed problems in this Thesis are summarized in Figure 1.1. In the ACFP, we use a model that represents a special case of our generalized nonlinear transform model. Since the focus is on a special type of ACFP modulation, the NT represen- tation appears in the linear system, which has to be solved in order to estimate the optimal distortion component that has to be added to the original data. In image denoising, the sparsifying transform model is also another special case of our generalized nonlinear transform model. To study the optimal solution for the the sparsifying transform model with a non-structured overcomplete transform matrix, we focus on a problem formulation that addresses a trade-off between (a) the alignment of the gradients in the approximative objective and the original objective and (b) lower bound tightness to the original objective. The usage of the aforementioned trade-off offers an acceleration in the local convergence of the solution next to leading to a satisfactory solution under a small amount of training data. Sparsity alone does not guarantee that the resulting representation will be discriminative. Up to the best of our knowledge, we provide the first work that extends the sparsifying model for learning sparse and discriminative representations while offering a high degree of freedom in modeling2 and imposing constraints other then the sparsity constraint on the representation. Considering an estimate with low variability w.r.t. a discrimination specific objective that is found trough the use of self-collaboration, we extend our generalized nonlinear transform model and explore a discrimination centered, collaboration structured and sparse modeling. In the final part of this Thesis, we jointly model and learn multiple NT transforms with explicit consideration of discrimination specific parameters. We propose a novel clustering principle, where we focus on measures that reflect a notion of joint similarity and dissimilarity score between a data point and a set of data points. Finally, we develop a concept that allows unsupervised discrimination and clustering not in the original data domain, but instead in a nonlinear transform domain, where a nonlinear transform model is used.

1.2 Thesis Outline

In Chapter2, we present the commonly used synthesis model, give its probabilistic interpreta- tion and provide the related inverse problem for the estimation of the synthesis representation.

2Many nonlinearities, i.e., ReLu, p-norms, elastic net-like, ℓ1 -norm ratio, binary encoding, ternary encoding, ℓ2 etc., can be modeled as a generalized nonlinear transform representation. 1.2 Thesis Outline 5

Afterwards, we introduce our generalized nonlinear transform model, which is a unified base for all our NT modeling across the considered applications in this thesis. In addition, we introduce the related direct problem, i.e., the constrained projection problem, which has a central role in estimation of the NT representation. In Chapter3, we introduce and describe the active content fingerprinting concept and explain the differences compared to passive content fingerprinting (PCFP). Then, by taking an approximation of the negative logarithm of a special case of our NT model, we introduce the generalized problem formulation as a form of min-max problem. Under a linear modulation and predefined linear feature map, we show a reduction to a constrained projection problem and provide the optimal solution. We also address an approximation of the predefined linear feature map in order to find appropriate trade-offs between the modulation distortion and feature robustness. Afterwards, we address a problem formulation, where we jointly learn the linear map and estimate the modulation distortion in order to attain a low modulation distortion and high feature robustness. In our numerical evaluation, our efficient solution demonstrates significant improvements compared to PCFP. Finally, we extend the basic concept of ACFP by focusing on a redundant content representation and include extractor and reconstructor functions, which we name as ACFP-LR. In the latter, we present the problem formulation and show a reduction to a constrained projection problem, which has an efficient solution. Our numerical evaluation shows that ACFP-LR has superior performance compared to the rest of the analyzed schemes. In Chapter4, we address the learning problem for the data adaptive transform that provides sparse representation in a space with dimensions larger than or equal to the dimensions of the original space. We show that the sparsifying transform model represents one reduced form of the generalized NT model. We present an iterative, alternating algorithm that has two steps: (i) transform update and (ii) sparse coding. In the transform update step, we focus on a novel problem formulation based on a lower bound of the objective function that addresses a trade-off between (a) how much the gradients are aligned of the approximative objective and the original objective and (b) how close is the lower bound to the original objective. This allows us not only to propose an approximate closed form solution, but also gives a possibility to find an update that can lead to an accelerated local convergence and enables us to estimate an update that achieves satisfactory solution under a small amount of data. Since in the transform update, the approximate closed form solution preserves the gradient and in the sparse coding step, we use the exact closed form solution and we show that the resulting algorithm is convergent. On the practical side, we evaluate our algorithm in an image denoising application. We demonstrate promising performance together with 6 Introduction advantages in training data requirements, accelerated local convergence and computational complexity. Chapter5 consists of three major sections. In the first section of Chapter5, we consider the face recognition problem from both machine learning and information coding perspectives, adopting an alternative way of visual information encoding and decoding thought estimation of a robust NT representation. Our model for recognition is based on multilevel vector quantization (MVQ) that is conceptually equivalent to a bag-of-word method and bears similarity to a convolutional neural network CNN. We introduce an alternative aggregation method over local bag-of-words decisions from locally estimated robust NT representations w.r.t. a learned centroids over local blocks. Moreover, we relate the local NT representation with a corresponding likelihood vector. We present a generalization of a sparse likelihood approximation, give connections to Maximum a Posterior (MAP) estimate, as well as showing connections to common techniques such as hard and soft thresholding. We evaluate our approach by extensive numerical simulation on face image databases, where we show improvements and competitive result w.r.t. the state-of-the art methods, while having low computational complexity. In the second section of Chapter5, we describe and explain our NT model, where we take into consideration the modeling and learning of a nonlinear transform that is parameterized by a linear map and generalized element-wise nonlinearity. In our modeling of the NT, we introduce the minimum information loss and discriminative priors for the respective linear map and sparse representations. During training, we estimate the model parameters by minimizing an approximation for the negative logarithm of the model. We propose an efficient iterative algorithm with convergence guarantee that alternates between two steps, which have approximate and exact closed form solutions. Given a test data sample, we estimate a sparse representation using the learned model parameters, which represents a solution to a low complexity constrained projection problem. The efficiency of the proposed approach, together with the potential usefulness of the NT representations is validated by numerical experiments in a supervised and unsupervised image recognition setup. The evaluation demonstrates advantages in comparison to the state-of-the-art methods of the same category, regarding learning time, the discriminative quality and the recognition accuracy. In the third section of Chapter5, we present an extension of our base NT model to another NT model for learning collaboration structured, discriminative and sparse representations. The idea is to model a collaboration corrective functionality between multiple nonlinear transforms in order to reduce the uncertainty in the estimate. The focus is on joint estimation of a data-adaptive NTs that take into account a collaboration component w.r.t. a discrimination target. The joint model includes the minimum information loss, collaboration corrective 1.3 Main Contributions 7 and discriminative priors. The model parameters are learned by minimizing the negative logarithm of the learning model, where we propose an efficient solution by an iterative, coordinate descend algorithm. Numerical experiments validate the potential of the proposed learning principle. The preliminary results show advantages in comparison to the state of-the-art methods. In Chapter6, we present a novel clustering concept based on (i) jointly learned NTs with minimum information loss and discriminative priors and (ii) min-max assignment over NT representations. In the common clustering algorithms a data point in the original data space is assigned to clusters based on the similarity correspondence. In contrast, we propose a simultaneous cluster and NT representation assignment principle based on evaluating a min-max score that approximates discriminative log likelihood in the transform domain. Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods regarding the learning time and the used clustering performance measures. In Chapter7, the conclusions summarize this Thesis.

1.3 Main Contributions

The main contributions of this thesis are summarized as follows:

- We introduce a generalized nonlinear transform model that contrary to the synthesis model, which is based on data reconstruction, relies on a constrained data projection

- We generalize the integrated maximum marginal (IMM) principle by taking into consideration the negative logarithm of the learning model for estimation of the NT model parameters, which enables efficient solutions for a number of applications

- We propose several novel active content fingerprinting (ACFP) schemes under linear modulation and linear feature maps, where we show optimal closed form solutions and efficient algorithms with convergence guarantees by utilizing a special case ofourNT model

- We study the sparsifying transform model with an overcomplete transform matrix for an image denoising application. The considered model represents a reduction and a special case of our NT model. We propose not only an alternating algorithm with an approximate and exact closed form solution and convergence guarantees, but also introduce a novel problem formulation that addresses a trade-off between accelerated local convergence and a satisfactory solution under small amount of data 8 Introduction

- We propose novel strategies for learning discriminative and robust NT representations that are useful for image recognition tasks in supervised and unsupervised setups. In addition, we consider task-centric self-collaboration. The NT model parameters estimation is based on our generalized IMM principle, which allow as efficient solution with convergence guarantee to be implemented by iterative alternating algorithm

- We present a novel clustering concept based on (i) jointly learned NTs with mini- mum information loss and discriminative priors and (ii) min-max assignment over NT representations, where we introduce the simultaneous cluster and NT represen- tation assignment principle, which is based on evaluating a score that approximates discriminative log likelihood in the transform domain. Chapter 2

Modeling and Estimation of Nonlinear Transform

In this chapter, we outline the well known synthesis model and its corresponding inverse problem. Then, we introduce our nonlinear transform model and its corresponding direct problem. Along this way, we also highlight the differences in the modeling approaches and the corresponding problems.

2.1 Sparse Synthesis Model vs Nonlinear Transform Model

2.1.1 Sparse Synthesis Model

As the name suggests, in many areas, the main idea behind this model, is to synthesize a data vector from a set of defined vectors that represent some dictionary.

Deterministic Formulation In the most general case, according to the synthesis model, a N M data sample xi ℜ of dimensionality N is approximated by a linear combination yi ℜ ∈ ∈ (referred to as a sparse data representation) of a few words (frame vectors) yi 0 << M, ∥ ∥ from a dictionary (frame1) D ℜN M, as: ∈ ×

xi = Dyi + vi, (2.1)

N where vi ℜ denotes the approximation error, which is usually assumed to be Gaussian. ∈ 1 N M A matrix D ℜ × is said to be overcomplete if M > N. Equivalently, if the number M of columns N ∈ dm ℜ in D is bigger than the dimensionality N of dm, i.e., M > N, one might also say that the set of vectors ∈ d1,d2,...,dM is linearly dependent and that this set forms a frame. { } 10 Modeling and Estimation of Nonlinear Transform

Probabilistic Formulation In a probabilistic sense, we consider that xi, yi and D are random vectors and random matrix, respectively. A conditional probability distribution of xi given the dictionary D can be expressed as:

p(xi D) = p(xi,yi D)dyi = p(xi yi,D)p(yi D)dyi, (2.2) M M | yi ℜ | yi ℜ | | Z ∈ Z ∈

where p(xi yi,D) models the relation (2.1), i.e.: |

1 2 p(xi yi,D) ∝ exp xi Dyi , (2.3) | −β ∥ − ∥2  0 

where β0 is a scaling parameter. In the prior term p(yi D), it is usually assumed that | yi is independent to D, i.e., p(yi) = p(yi D). Moreover, assuming that the entries in the | representation yi are i.i.d. and follow Laplace distribution, then we have that:

1 p(yi) ∝ exp yi 1 , (2.4) −β ∥ ∥  1 

where . 1 denotes the ℓ1-norm and β1 is a scaling parameter. ∥ ∥

Learning The Model Parameters Given CK data samples X = [x1,...,xCK], we model conditional probability p(X D) that under the independence assumption between the data | samples xi decomposes as: CK p(X D) = ∏ p(xi D). (2.5) | i=1 | Moreover, instead of just working with p(X D), we can use the Bayes’ rule and consider an | approximative posterior: p(D X) ∝ p(X D)p(D), (2.6) | | were we disregard the prior p(X), while the prior p(D) on the dictionary D is defined as:

p(D) ∝ exp( ΩS(D)), (2.7) −

where ΩS(.) is the prior measure that defines the properties of the dictionary D. Under the above considerations, the Maximum a Posterior (MAP) estimations of D and

Y = [y1,...,yCK] can be expressed as:

Yˆ ,Dˆ = argmax p(D X) argmax p(X D)p(D), (2.8) { } Y,D | ≃ Y,D | 2.1 Sparse Synthesis Model vs Nonlinear Transform Model 11 or equivalently taking the negative logarithm of p(X D)p(D), the problem reduces to: | Yˆ ,Dˆ argmin[ log p(X D) log p(D)] = { } ≃ Y,D − | − CK argmin log p(xi D) log p(D) = , ∑ (2.9) Y D "− i=1 | − # CK argmin log p(xi yi,D)p(yi)dyi log p(D) . ∑ M Y,D − yi ℜ | − " i=1 Z ∈ #

The estimation of Yˆ and Dˆ is still difficult to compute due to the integration over yi. If we replace p(yi,xi D) with its extreme value, then we end up with the following problem: | CK ˆ ˆ 1 2 Y,D = argmin ∑ min xi Dyi 2 + λ1 yi 1 + Ω(D), (2.10) { } D yi 2∥ − ∥ ∥ ∥ i=1   

1 1 1 where we assumed that , = ,λ1 . { β0 β1 } { 2 } Sparse Representation Estimation Assuming that the dictionary D is given, then (2.10) − per individual sparse representation yi, reduces to:

1 2 yi = argmin xi Dyi 2 + λ1 yi 1, (2.11) yi 2∥ − ∥ ∥ ∥ which represents an inverse problem w.r.t. yi that also is known as a constrained regression problem.

2.1.2 Nonlinear Transform Model

In this thesis, we focus on a model that describes a generalized nonlinear transform represen- tation yi for the data sample xi.

Deterministic Formulation We express our nonlinear transform model as:

Axi = yi + zi, (2.12)

M N where A ℜ × is the linear map of the nonlinear transform, yi is the nonlinear transform ∈ M representation and zi ℜ is the nonlinear transform error vector. In contrast to the synthesis ∈ model, in the nonlinear model, one assumes that the nonlinear transform representation yi results from applying a generalized element-wise nonlinearity to Axi that is parameterized by θ , i.e.,

yi = fθ (Axi), (2.13) 12 Modeling and Estimation of Nonlinear Transform

where θ are parameters, which allows us not only to consider a notion for sparsity, but also to take into account robustness or discrimination. Examples One simple example of such a transform is a sparsifying transform model − M [119], where the parameter θ = λ1 ℜ ,λ ℜ with a hard thresholding function fh that ∈ ∈ acts as a nonlinear transform, i.e.:

yi(m), if yi(m) > λ, fh(yi(m)) = | | (2.14) 0, otherwise,  other example is a soft thresholding function, i.e.:

yi(m) λ, if yi(m) > λ, − fs(yi(m)) = y (m) + λ, if y (m) > λ, (2.15)  i i  − 0, otherwise,  which can be compactly expressed as:

fs(yi) = sign(Ayi) max( Ayi λ1,0), (2.16) ⊙ | | − where sign is a sign function and is the Hadamard product. The third example is the ternary ⊙ encoding:

ft(yi) = sign(max( Ayi λ1,0)). (2.17) | | − Another also interesting example is the ReLu activation function, that is commonly used in the deep neural networks, i.e.:

fReLu(yi) = max(Ayi,0). (2.18)

Probabilistic Formulation To introduce a probabilistic interpretation of a nonlinear trans- form model, we will consider the marginal probability distribution, which we express as:

p(yi xi,A) = p(yi,θ xi,A)dθ , (2.19) | θ | Zθ Furthermore, we can use the chain rule, which leads to:

p(xi,yi,θ A) = p(yi,θ xi,A)p(xi A) = p(xi θ ,yi,A)p(yi,θ A), (2.20) | | | | |

and assume that p(xi A) = p(xi). | 2.1 Sparse Synthesis Model vs Nonlinear Transform Model 13

We are interested in modeling p(θ ,yi xi,A), where under the Bayes’ rule, we focus on | the proportional form that we express as:

p(yi,θ xi,A) ∝ p(xi θ ,yi,A)p(yi,θ A). (2.21) | | |

In the simplest case, p(xi θ ,yi,A) models the residual vector zi = Axi yi as: | −

1 2 p(xi θ ,yi,A) ∝ exp Axi yi , (2.22) | −β ∥ − ∥2  0  where β0 is a scaling parameter. We note that any additional knowledge about the residual vector zi can be used and added in the model p(xi θ ,yi,A). | In order to simplify the consideration, we neglect the dependence on A by assuming that:

1 p(yi,θ A) = p(yi,θ ) ∝ exp m(θ ,yi) , (2.23) | −β  1  M M where m(.) : ℜ ℜ ℜ is a measure and β1 is a scaling parameter. The motivation × → behind the use of such a parametric prior p(yi,θ ) on yi is to accommodate a class of assumptions related to sparsity, robustness and/or discrimination.

Learning The Model Parameters Given CK data samples X = [x1,...,xCK], under our consideration, we consider the following learning model:

CK p(Y,A X) = p(Y A,X)p(A X) =∏ p(θ ,yi xi,A)dθ p(A xi) | | | θ | | i=1 Z (2.24) CK ∝∏ p(xi θ ,yi,A)p(yi,θ )dθ p(A xi). θ | | i=1 Zθ where we use a simplification for the prior on the linear map A, i.e., p(A xi) = p(A) and we | define it as: p(A) ∝ exp( Ω(A)), (2.25) − with Ω(.) denoting the prior measure which defines the properties of the rows of A. Minimizing the exact negative logarithm of our learning model (2.24) over Y, θ and A is difficult since we have to integrate in order to compute the marginal and the partitioning function of the prior p(y,θ ). Instead of minimizing the exact negative logarithm of the marginal p(xi θ ,yi,A)p(yi,θ )dθ , we consider minimizing the negative logarithm of its θ | R 14 Modeling and Estimation of Nonlinear Transform

maximum point-wise estimate, i.e.,

p(xi θ est,yi,A)p(yi,θ est)dθ est Dp(xi θ ,yi,A)p(yi,θ ), (2.26) θ | ≤ | Zθ est

where we assume that θ are the parameters for which p(xi θ est,yi,A)p(yi,θ est) has the | maximum value and D is a constant. Furthermore, we use the proportional relation (2.21)

and by disregarding the partitioning function related to the prior p(yi,θ ), we end up with the following problem formulation:

log p(xi θ ,yi,A) − | log p(yi,θ ) log p(A) CK 1 − − ˆ ˆ ˆ   2   Y,θ ,A = arg min ∑ Axi yi 2 +λ1m(θ ,yi) + Ω(A) , (2.27) { } Y,θ ,A i=1 z2∥ }|− ∥{   z }| { z }| {              1 where 2, = β0,β1 . { λ1 } { } We note that in general, depending of the used measures that describe p(xi θ ,yi,A) | and p(θ ,yi), even the exact minimization w.r.t. the the point-wise estimate might still be difficult to compute. Since in order p(xi θ ,yi,A) and p(θ ,yi) to be a properly factored | probabilities, p(xi θ ,yi,A) and p(θ ,yi) have to contain partitioning functions, which can be | exactly evaluated only by integrating over the involved parameters xi, yi, θ and A. CK Alternativly, the maximization of our lernining model ∏ p(xi θ ,yi,A)p(yi,θ )p(A xi) i=1 θ | | over any of the variables yi, θ and A can be seen as an approximativeR form of integrated CK marginal maximization (IMM) [116] of ∏ p(xi θ ,yi,A)p(yi,θ )p(A xi) over the re- i=1 θ | | spective yi, θ or A, which can be summarized byR the following steps:

- Approximative maximization of p(xi yi,θ ,A) with prior p(θ ,yi) over yi, | - Approximative maximization of p(xi yi,θ ,A) with prior p(θ ,yi) over θ and | CK - Approximative maximization of ∏ (xi yi,θ ,A) with prior p(A) over A. i | NT Representation Estimation Assuming that the linear map A and the parameter θ − is given, then the exact estimation of yi w.r.t. our model is equivalent to computing the minimum of the negative logarithm over p(xi θ ,yi,A)p(θ ,yi), i.e.: |

yˆi = argmin[ log p(xi yi,θ ,A) log p(θ ,yi)], (2.28) yi − | −

where again we point out that it might be difficult to compute depending of the chosen

measures that describe p(xi yi,θ ,A) and p(θ ,yi), that involve a possible integration in the | corresponding partitioning functions for p(xi yi,θ ,A) and p(θ ,yi). | 2.1 Sparse Synthesis Model vs Nonlinear Transform Model 15

On the other hand estimating yi in p(xi yi,A)p(θ ,yi) w.r.t. the approximative IMM | principle reduces to the following constrained projection problem:

2 yˆi = argmin λ0 qi yi 2 + λ1m(θ ,yi) , (2.29) yi ∥ − ∥   where qi = Aixi and λ0,λ1 are inversely proportional to β0,β1 . In contrast to the { } { } corresponding inverse problem (2.10) that is related to the synthesis model, (2.29) represents a direct problem that is related to the nonlinear transform model. Moreover, one can efficiently solve (2.29) for a broad class of measures m(.,.) including min and min max measures. − Finally, it should be noted that (2.29) can also be expressed in an alternative form, i.e.:

1 2 yˆi = argmin qi yi 2, subject to m(θ ,yi) M , (2.30) yi 2∥ − ∥ ∈ where the constraint m(θ ,yi) is satisfied, if it belongs to the set M . The alternative formula- tion (2.30) is very useful for several problems, which we address in this PhD Thesis.

2.1.3 Application Perspectives of the NT Model

In this Thesis, we focus on the applications of the nonlinear model (2.12) and its probabilis- tic counterpart (2.19) for several problems covering: active content fingerprinting, image denoising, image recognition and clustering. To accommodate the different applications, we use different variations of our non- linear transform model that results from taking into account different assumptions about p(xi yi,θ ,A), p(yi,θ A) and p(A) as shown in Table 2.1. | | ACFP Given the original data content xi, we reduce the general nonlinear transform model:

Axi = fθ (Axi) + zi, (2.31) to the ACFP model by taking into account that A = I, which leads to:

xi = fθ (xi) + zi, (2.32) where yi = fθ (xi). Furthermore, we show that:

yi = xi zi, (2.33) − with † zi = θ (φ (θ xi)), (2.34) 16 Modeling and Estimation of Nonlinear Transform

† where φ(.) is the nonlinear feature mapping and θ is the inverse map of θ . Term zi represents the considered ACFP modulation component that is added to the original data

content xi, which is similar to the modulation in the digital watermarking.

Image Denoising In the image denoising problem, the nonlinear transform model (2.12) takes the form:

Axi = fh(Axi) + zi, (2.35)

where xi in the noisy image patch. In this model, the nonlinear transform representation, i.e.:

yi = fh(Axi), (2.36)

is actually the sparsifying transform representation, since, θ = λ1 and fh(.) is the hard thresholding function as defined in the previous section. A is the transform dictionary while

zi is the sparsifying transform error vector.

Image Recognition In order to tackle an image recognition problem, we will use a common model that we express as follows: qi =yi + zi, (2.37) yi = fθ (qi),

where qi is the base representation, yi is the nonlinear transform representation and zi is its nonlinear transform error vector.

Part 1 In the fist part of Chapter5, we use local image blocks xi, clusters D = − N C [d1,...,dC] ℜ over the local image blocks and a function fe(.) to define the representa- ∈ × tion qi as: qi = fe(D,xi), (2.38)

while targeting a robust nonlinear transform representation, yi = fθ (qi), where fθ (qi) is a solution to an ℓ1 -norm ratio constrained projection problem. ℓ2 Part 2 In the second part of Chapter5, we define the representation qi as: −

qi = Axi. (2.39)

During learning, we consider a discrimination measure for our discrimination prior while we

estimate A, yi and we take into account an implicit form of the parameters θ . The nonlinear transform representation takes the from:

yi = f (Axi) = sign(Axi) max( Axi f1(θ ),0) f2(θ ), (2.40) θ ⊙ | | − ⊘ 2.1 Sparse Synthesis Model vs Nonlinear Transform Model 17

Nonlinear transform model p(xi yi,θ ,A) p(yi,θ A) p(A) ACFP | | - p(xi yi,θ ,A) = p(xi yi) p(yi,θ A) = p(yi,θ ) Chapter3 | | | Denoising p(xi yi,θ ,A) = p(xi yi,A) p(yi,θ A) = p(yi) p(A) Chapter4 | | | p(xi yi,θ ,A) = p(xi yi) p(yi,θ A) = p(yi) - | | | Recognition p(xi yi,θ ,A) = p(xi yi,A) p(yi,θ A) = p(yi,θ ) p(A) | | | Chapter5 p(xi Y i ,θ ,A) = p(Y i ,θ A) = p(A) = | { } L { } | L p(xi Y i ,A) ∏l=1 p(yl, i ,θ l) ∏l=1 p(Al) | { } { } Clustering p(xi yi,θ ,A) p(yi,θ A) = p(yi,θ ) p(A)

The addressed applications Chapter6 | |

Table 2.1 The nonlinear transform model and the applications considered in this Thesis.

M M M M where f1(.) : ℜ ℜ and f2(.) : ℜ ℜ are the generalized element-wise thresholding → → and normalization functions, respectively and denotes the Hadamard division. ⊘ Part 3 In the third part of Chapter5, we extend our nonlinear transform model − by considering a relation over the errors in L nonlinear transform representations Y i = M L { } [y1, i ,...,yL, i ] ℜ × that represents a form of discrimination specific self-collaboration. { } { } ∈ A .1 In addition, A = . ℜLM N . This results in an additional collaboration component . ∈ × " AL # M cl,i ℜ ,1 l L, which is added in the definition of the representation ql,i, i.e.: ∈ ≤ ≤

ql,i = Alxl,i cl,i, (2.41) − for the lth nonlinear transform representation yl,i out of the total of L, that are used during learning. In this model, yl,i has the following form:

yl,i = sign Alxl,i cl,i max Axl,i cl,i f1(θ l),0 f2(θ l). (2.42) − ⊙ | − | − ⊘   where θ l are discrimination parameters related to the l-th nonlinear transform representation.

Clustering In our clustering method, we consider a model that takes the generalized form:

Axi = fθ (Axi) + zi. (2.43) 18 Modeling and Estimation of Nonlinear Transform

Compared to the previous models, the difference in this model is how θ are learned during

the estimation of the model parameters as well as the definition of the function fθ (.), i.e.:

fθ : yi =y cˆ, wherec ˆ = arg min m(y c,θ ), with | 1 c C | ≤ ≤ (2.44) y c = sign(Axi) max( Axi fc,t(θ ),0) fc,n(θ ), | ⊙ | | − ⊘ while C is the number of clusters and 1 c C. Note that (2.44) represents a form of ≤ ≤ an assignment based on a discrimination measure m(y c,θ ) over the parameters θ and a | number of C candidate nonlinear transform representations. A single candidate nonlinear M M transform representation y c is estimated using Axi and fc,t(.) : ℜ ℜ and fc,n(.) : | → ℜM ℜM which are the generalized element-wise thresholding and normalization functions, → respectively per the cluster related nonlinear transforms. In the next chapters, we will present and elaborate the above applications of the nonlinear transform model in more details, give the corresponding algorithms for the parameter estima- tion/learning and compare their performance with respect to the state-of-the-art methods. Chapter 3

Estimation and Learning of NT Based Modulation for ACFP

Active Content Fingerprinting (ACFP) has emerged as a synergy between digital watermark- ing (DWM) and passive content fingerprinting (PCFP) [1]. This alternative approach covers a range of applications in the case when content modulation is appropriate, prior to the content distribution/reproduction such as content authentication, identification and recognition. It was also theoretically demonstrated that the identification capacity of ACFP [2] under additive white Gaussian channel distortions and ℓ2-norm embedding distortion is considerably higher to those of DWM and PCFP. Interestingly, the optimal modulation of ACFP produces a correlated modulation to the content in contrast to the optimal modulation of DWM where the watermark is independent from the host. Several scalar and vector modulation schemes for ACFP were proposed [3], [4] and tested on synthetic signals and collections of images. Despite the attractive theoretical properties of ACFP, the practical implementation of ACFP modulation with an acceptable complexity, capable to jointly withstand signal process- ing distortions such as additive white Gaussian noise (AWGN), lossy JPEG compression, histogram modifications, etc. and geometrical distortions (affine and projective transforms) remains an open and challenging problem.

3.1 Passive Content Fingerprinting

In recent years, local, i.e., patch-based, compact, geometrically robust, binary descriptors such as SIFT [5], BRIEF [6], BRISK [7], ORB [8] and the family of LBP [9] have become a popular tool in image processing, computer vision and machine learning. We refer to these 20 Estimation and Learning of NT Based Modulation for ACFP local descriptors as local passive content fingerprint (PCFP), since the original data is nota subject to change or modification. Given an original image, we assume that a local image patch x ℜN 1 is extracted ∈ × around a local key point. Usually the patch extraction is performed according to the patch orientation defined for example by a patch gradient (shown in red arrow in Figure (3.1)). In the most general case, given a patch x, the local features are extracted using a mapping function, i.e.,

f = f2(x), (3.1)

N 1 M 1 where f2 : ℜ × ℜ × is a feature mapping function and M is the length of the descriptor. → M N Under a linear mapping function, we have that f2(x) = Fx, where F ℜ is a linear map. ∈ × We point out that the map can be either predefined, data independent and analytic or learned, data dependent and adaptive. The mapping, that is followed by a quantization results in the local fingerprint, i.e.,

b = Q( f2(x)) = Q(Fx), (3.2) where Q : ℜM ℜM is a quantization function. We note that the differences between the → existing classes of local descriptors are determined by the defined mapping f2(.) and the type of the quantization Q(.).

3.2 Active Content Fingerprint

The ACFP framework consists of content modulation, prior to its reproduction, and descriptor extraction that includes feature mapping and quantization. The core idea behind the ACFP modulation [43] and [78] is based on the observation that the magnitude of the feature coefficients before the quantization influences the probability of bit error in the descriptor bits. Descriptor bit flipping is more likely for low magnitude coefficients. Therefore, itis natural to modify the original content by an appropriate modulation and to increase these magnitudes subject to distortion constraints. We take into account our NT model, where we assume that:

p(y x,A) = p(y x) = p(y,θ x)dθ ∝ p(x y,θ )p(θ ,y)dθ , (3.3) | | θ | θ | A=I Zθ Zθ and furthermore, we assume that the modeling of p(x y,θ ) is independent of θ , i.e., | p(x y,θ ) = p(x y), (3.4) | | 3.3 ACFP with Predefined Linear Feature Map 21 where p(x y) models the modulation distortion and p(θ ,y) models feature robustness. | In the most general case, we define:

1 p(x y) ∝ exp ϕ ( f1(x),x) , (3.5) | −β  0  where the variable y = f1(x) is the ACFP modulation that modifies the local data while ϕ(.) is a measure that penalizes the modulation distortions and β0 is a normalization variable. We define: 1 p(θ ,y) ∝ exp ψ ( f2 (y,θ ),τ) , (3.6) −β  1  where ψ(.) penalizes non-robust feature components that do not satisfy the modulation level

τ, in the transform domain, while f2(y,θ ) transforms the modified local data y using the M N parameters θ ℜ and β1 is a normalization variable. ∈ × If we disregard the partitioning function of p(x y)p(y,θ ), then the minimum over the | negative logarithm of our model gives us the following general problem formulation. Proposition 1: The generalized ACFP can be seen as a solution to a problem of functions estimation: ˆ ˆ f1, f2 = argminϕ ( f1 (x),x) + λ1ψ ( f2 ( f1 (x),θ ),τ), (3.7) { } f1, f2 where in our case y = f1(x), and λ0,λ1 are variables inversely proportional to β0,β1 . { } { } In general (3.7) addresses the trade-off between two conflicting requirements of the feature coefficient magnitude increase for the probability of bit error reduction andthe modulation distortion. In following, we will address different variants of the general problem formulation (3.7) by taking into account different functions f1 and f2, which include cases with fixed linear map θ , data adaptive linear map θ as well as modulation on the original data and on a redundant latent representation for the original data representation.

3.3 ACFP with Predefined Linear Feature Map

An explicit regularization of the trade-off between modulation distortion and feature robust- ness in (3.7) can be introduced in order to impact its solution quality. Several constraints can by considered, such as: the distribution of the modulated modifications, the distribution of the feature descriptor modifications and constraints on the range of values (compactness) of the optimal solution. 22 Estimation and Learning of NT Based Modulation for ACFP

Original Image Patch

Feature Extraction x Quantization 0100110111 Q (CTx)

Tx CTx Distortion

y Modulation Constraints miny ϕ (y, x) + λc c = ψ (CTy, τ)

Modulated Image Patch

Fig. 3.1 Local ACFP framework.

3.3.1 Contributions

In this section, we focus on a predefined and given linear feature map and present the following contributions: (i) We show the reduction of the generalized ACFP problem to a constrained projection problem under linear modulation and linear feature extraction (ii) Give the optimal closed from solution under invertible linear feature maps (iii) In order to attain low modulation distortion, we address an approximation to the linear feature map, while covering two cases: 1) Linear feature maps where the number of rows is bigger than the number of columns; 2) Linear feature maps where the number of rows is smaller then the number of columns.

3.3.2 Reduction to Constrained Projection and Closed Form Solution

In the following, we consider the following constraints: 1. The first mapping function, is linear, and it is parametrized by amatrix M ℜN N, ∈ × i.e.:

f1 (x) = Mx, (3.8) the second mapping function f2 ( f1 (x),θ ) is linear and it is parametrized by a matrix θ = F ℜL N. ∈ × f2 ( f1 (x,θ )) = f2 ( f1 (x,F)) = F f1 (x). (3.9) 3.3 ACFP with Predefined Linear Feature Map 23

The functions ψ(.) and ϕ(.) are a priory defined. Furthermore, the linear map factors: as

L N F = CT ℜ × , (3.10) ∈ where T ℜM N denotes a transformation matrix and C 1,0,+1 L M denotes a ∈ × ∈ {− } × constraint matrix. 2. A variable reduction is defined as y = Mx which leads to

ϕ(Mx,x) = x Mx 2 = x y 2, (3.11) ∥ − ∥2 ∥ − ∥2

3. The function ψ(.) is replaced by a constraint φ (θ ,y) e 0, where e denotes ≥ ≥ element-wise inequality and 0 ℜL 1 is a zero vector. Concretely, we define φ (F,y) as: ∈ × φ (F,y) = Fy τ1, (3.12) | | − where Fy is an element-wise absolute value of the vector Fy. | | Under the above constraints, in the following, we give the reduced problem.

Proposition 2: Given the constraints 1,2 and 3, ACFP with linear modulation and a linear feature map reduces to a constrained projection problem:

1 2 yˆ = argmin y x , subject to Fy e τ1, (3.13) y 2∥ − ∥2 | | ≥ where y is the modulated data, x is the original data representation, whereas Fy are the robustified features. The main result about the global optimal solution of (3.13) is stated by the following theorem.

N Theorem 1: If xe ℜ such that Fxe = te, where: ∃ ∈

te = (sign(Fx) max τ1 Fx ,0 ), (3.14) ⊙ { − | | } then the solution to (3.13) is y = x + xe, where represents the Hadamard (element-wise) ⊙ product. Moreover, if F is invertible or pseudo invertible then the closed form solution to (3.13) is: † y = x + F te (3.15) where F† is the inverse (or pseudo inverse) of F. Proof: See Appendix A.1. In our solution (3.15), y is the modulated data, x is the original data, F is the linear map, while xe is the modulation component that is added to x in order to have robustified features 24 Estimation and Learning of NT Based Modulation for ACFP

Approximate Linear Map F† ≈ F Original Image Patch

Feature Descriptor Extraction x Quantization 0100110111 Q F†x 

F†x Distortion

y Modulation Constraints miny ϕ (y, x) + λc c = ψ F†y, τ

Modulated Image Patch 

Fig. 3.2 Local ACFP framework with approximation to the linear map.

Fy = F(x+xe). We note that in the ACFP modulation (3.15), the NT representation te (3.14) appears in the linear system Fxe = te, which has to be solved in order to estimate the optimal distortion component xe that has to be added to x.

3.3.3 Giving Up Distortion or Exact Feature Descriptor Properties

In the general case, for any linear map without taking into account its properties, it is possible to use the closed form solution (3.15). In that case, robust feature descriptors will be achieved, but the level of modulation distortion is not regularized explicitly and might be unacceptably high. On the other hand by considering an arbitrary linear feature map, we can address the trade-off between modulation distortion and feature robustness by modifying the linear map. Therefore, instead of using the exact linear map F, it is possible to use an approximate one, such that the distortions are as small as possible and the approximate map is as close as possible to the true map. In the following, we address two general cases. when the linear feature map F is undercomplete or overcomplete. 3.3 ACFP with Predefined Linear Feature Map 25

Proposition 2: The closest1 orthogonal matrix B to the matrix F ℜL N,L N is the ∈ × ≤ solution to the following problem:

T Bˆ = argmin F B F subject to BB = I, (3.16) B ∥ − ∥

T T where the optimal solution of (3.16) is B = UIL NV and UΣV is a singular value decom- × position (SVD) of F. Then (3.15) takes the form as:

T T T y = x+VIN LU (sign UIL NV x max τ1 UIL NV x ,0 ). (3.17) × × ⊙ { − | × | }  where y is the modulated data sample and x is the original data sample.

Proposition 3: The closest incoherent matrix P ℜL N to the matrix F ℜL N,L > N, is ∈ × ∈ × the solution to the following problem:

Pˆ = argmin F P F subject to µ (P) εµ , (3.18) P ∥ − ∥ ≤ where:

T pip | j | µ (P) = max 2 2 , (3.19) i= j pi 2 p j 2 i, j ̸1,...,L ∥ ∥ ∥ ∥ ∈{ } and pi is the ith row of P. L N Given any incoherent matrix P P ℜ , µ (P) εµ , the solution of (3.18) is ∈ { ∈ × ≤ } equivalent to a product of a rotation matrix R ℜL L and the incoherent matrix P. This ∈ × decomposition is not unique. Nevertheless, the rotation matrix R is a solution to the following problem [12]:

T Rˆ = argmin RP F F subject to RR = I. (3.20) R ∥ − ∥

The optimal solution is R = UVT , where UΣVT is the SVD of FPT . Therefore B = UVT P and using (3.15) the final solution is:

y = x + P†VUT (sign UVT Px max τ1 UVT Px ,0 ). (3.21) ⊙ { − | | }  It is important to note that in the above two solutions, we approximated the linear feature map and changed its exact properties in order to achieve as small as possible modulation distortion. 1Under Frobenius norm Solution: Linear modulation

26 Estimation and Learning of NT Based Modulation for ACFP

modulated quantized original data features data target features

Fig. 3.3 A general scheme for joint ACFP modulation and linear feature map learning.

3.4 Joint ACFP and Linear Feature Map Learning

In the previous section the ACFP with a linear modulation subject to a convex constraint on the properties of the resulting local descriptors was proposed, along with the optimal solution when the feature map is invertible. The main open issues with the proposed optimal solution are related to the assumptions about the linear feature map. Therefore, we also addressed a case from a perspective of direct approximation to the linear feature map. Nonetheless, although the linear feature maps are crucial for achieving a small modulation distortion and high feature robustness, in the previous section, we used only a predefined linear feature map.

3.4.1 Contributions

In this section, we address the learning problem of the linear feature map with linear ACFP modulation in order to attain a low modulation distortion and explicitly regularize the feature properties. The contributions of this section are given in the following: (i) We introduce a novel problem formulation for joint linear map learning and linear data modulation (ii) We propose an iterative alternating algorithm with optimal solutions in the correspond- ing iterating steps (iii) We provide a convergence result for the iterating sequence of the objective function values generated by the iterating steps of the proposed algorithm.

3.4.2 Problem Formulation

N CK Assume that a data set X = [x1,...,xM] ℜ × and a corresponding target feature matrix L CK ∈ L = [l1,...,lL] 1,1 are given, with a total of CK available data samples. The ∈ {− } × 3.4 Joint ACFP and Linear Feature Map Learning 27 learning of a linear feature map for ACFP with linear modulation is addressed by considering the following problem formulation:

T Yˆ ,Fˆ = argming(Y,F), subject to (FY) L e τ11 , (3.22) { } Y,F ⊙ ≥ where g(Y,F) = Ω1(X,Y) + Ω2(F), denotes the Hadamard product, e denotes element- ⊙ ≥ wise inequality, Y are the modulated data and the matrix F is the linear map. The terms T Ω1(X,Y), Ω2(F) and the inequality (FY) L e τ11 induce constraints on the modulation ⊙ ≥ error, the properties of the linear map F and the modulated features FY, respectively.

The penalty Ω1(X,Y) is defined as:

L L λ1 2 λ2 2 Ω1(X,Y) = ∑ xi yi 2 + ∑ F(xi yi) 2, (3.23) 2 i=1 ∥ − ∥ 2 i=1 ∥ − ∥ where we also regularize the bound for the changes on the extracted linear features form the 2 original content and the modulated content with F(xi yi) while Ω2(F) is defined as: ∥ − ∥2

λ3 2 T Ω2(F) = F λ4 log detF F , (3.24) 2 ∥ ∥F − | | where λk are Lagrangian multipliers, k 1,2,3,4 . The F F penalty helps regularize the ∀ ∈ { } ∥ ∥ scale ambiguity in the solution of (3.22). The log det (FT F) and F 2 are functions of the | | ∥ ∥F singular values of F and together help regularize the conditioning of F. Problem (3.22) is non-convex in the variables F and Y. If the variable F is fixed, (3.22) is convex. Conversely if Y is fixed, (3.22) is convex. We propose an alternating algorithm that has two steps and solves (3.22) by iteratively updating F and Y. In step 1 (Linear modulation) given the linear map F, the modulated data Y are estimated by a global optimal solution. In step 2 (Linear map estimate) given Y the linear map F is estimated by global optimal solution. An illustration of the proposed learning scheme is given in Figure 3.3.

3.4.3 Learning Algorithm (AFIL)

(t 1) Step 1 : Linear Modulation Given the linear map F − at iteration number (t 1), note t − t that (3.22) is separable for all yi at iteration number t. Therefore, per individual yi, (3.22) 28 Estimation and Learning of NT Based Modulation for ACFP reduces to the following problem:

t λ1 t 2 λ2 (t 1) t 2 yˆ = argmin xi y + F (xi y ) , i t i 2 − i 2 yi 2 ∥ − ∥ 2 ∥ − ∥ (3.25) (t 1) t subject to (F − y ) li e τ1, i ⊙ ≥ t where yi is the modulated data that has to be estimated and the modulation distortion is t 2 L defined by the error xi y . We introduce an auxiliary variable v ℜ and accordingly ∥ − i∥2 ∈ define an element-wise indicator function:

+∞, if v(l) > 0, I(v(l)) = (3.26) 0, otherwise.  Then (3.25) equivalently is: 

t λ1 t 2 λ2 (t 1) t 2 yˆ ,vˆ = argmin xi y + F (xi y ) + (v(l)), i t i 2 − i 2 ∑I { } yi,v 2 ∥ − ∥ 2 ∥ − ∥ l (3.27) (t 1) t subject to (F − y ) li τ1 v =e 0. i ⊙ − −

λ1 t 2 λ2 (t 1) t 2 The augmented Lagrangian of (3.27) is L (yi,v,s) = 2 xi yi 2 + 2 F − (xi yi) 2 + T (t 1) t ρ (t 1) t ∥ − ∥ 2 ∥ − ∥ ∑l I(v(l)) + s ((F − y ) li τ1 v) + (F − y ) li τ1 v , where s is the dual i ⊙ − − 2 ∥ i ⊙ − − ∥2 Lagrangian variable related to the equality constraint in (3.27) and ρ is the parameter of the t (t 1) augmented Lagrangian. Denote for clarity and simplicity x = xi,y = xi,l = li and F = F − . The Alternating Direction Method of Multipliers (ADMM) [110] is used for (3.27) and the problem is solved by iterating the following 3 steps:

k ρ k k 1 k 1 2 λ1 k 2 λ2 k 2 y =argmin (Fy ) l τ1 v − + s − 2 + x y 2 + F(x y ) 2, yk 2 ∥ ⊙ − − ∥ 2 ∥ − ∥ 2 ∥ − ∥ k k k 1 (3.28) v =max((Fy ) l τ1 + s − ,0), ⊙ − k k 1 k k s =s − + (Fy ) l τ1 v . ⊙ − − Note that the problem related to yk has a closed form solution as:

k † T T k 1 k 1 T y =B B (ρF ((τ1 + v − s − ) l) + (λ1I + λ2F F)x), (3.29) − ⊙ T † T 1 T where B = (ρ + λ2)F F + λ1. The matrices B and the pseudo-inverse B = (BB )− B are computed only once and reused in the solutions for all yk and for all yt,i 1,...,CK . Since i ∈ { } 3.4 Joint ACFP and Linear Feature Map Learning 29

(3.27) is an equality constrained convex optimization problem over the non-negative orthan L ℜ+, the proposed ADMM algorithm (3.28) gives the optimal solution to (3.27).

Step 2 : Linear map estimation Let the original data X and the modulated data Yt at iteration number t be given, then (3.22) reduces to the following problem:

ˆ t λ2 t t 2 λ3 t 2 t T t F = argmin F (X Y ) F + F F λ4 log det((F ) F ) , Ft 2 ∥ − ∥ 2 ∥ ∥ − | | (3.30) t t T subject to (F Y ) L e τ11 . ⊙ ≥ Again by introducing an auxiliary variable W ℜL CK, problem (3.30) equivalently has the ∈ × following form:

λ λ Fˆ t,Wˆ = arg min 2 Ft(X Yt) 2 + 3 Ft 2 λ log det((Ft)T Ft) + (W(m,l)), t F F 4 ∑I { } F ,W 2 ∥ − ∥ 2 ∥ ∥ − | | m,l t t T T subject to (F Y ) L τ11 W =e 00 . ⊙ − − (3.31) Problem (3.31) is addressed similarly as in the previous subsection. We first denote F = Ft and Y = Yt, then the augmented Lagrangian to (3.31) is evaluated as L (F,W,Q) = λ2 2 λ3 2 T t T T F(X Y) + ∑m,l I(W(m,l))+ F λ4 log det(F F ) + S (FY) L τ11 W + 2 ∥ − ∥F 2 ∥ ∥F − | | ⊙ − − ρ (FY) L τ11T W 2 , where S is the Lagrangian dual variable. We denote: 2 ∥ ⊙ − − ∥F  λ λ ρ G = 3 I + 2 (X Y)(X Y)T + Y LLT YT , 2 2 − − 2 ⊙ ⊙ (3.32) k 1 T k 1 k 1 Z − =τ11 W − + S − − and in the following, we give the ADMM steps for the solution of (3.31):

k k k T k k 1 T k T k F =argminTr F G(F ) F Y(Z − ) λ4 log det((F ) F ) , Fk { − } − | | k k T k 1 (3.33) W =max((F Y) L τ11 + S − ,0), ⊙ − k k 1 k T k S =S − + (F Y) L τ11 W . ⊙ − − Note that the problem related to Fk ℜL N (3.33), when L = N, has a closed form solution ∈ × as: 1 1/2 Fk = U Σ + Σ2 + λ I VT R 1, (3.34) 2 N N 4 − T    1 k 1 T T where UΣNV is the singular value decomposition of R− Y(Z − ) , RR is Cholesky factorization of G. The complete proof given in [123]. Since (3.31) is a convex problem, the 30 Estimation and Learning of NT Based Modulation for ACFP

iterative sequence by the solutions of the proposed ADMM method (3.33) converges to the optimal solution of (3.31).

3.4.4 Local Convergence Analysis

This section presents the result on the local convergence of the proposed alternating algorithm.

Lemma 1: Given initial X,L , the sequence g(Ft,Yt) generated by the proposed algorithm { } is monotone decreasing i.e., g(Ft,Yt) g(F(t 1),Yt) g(F(t 1),Y(t 1)) and the function ≤ − ≤ − − g(F,X) is lower bounded. Therefore, the alternating algorithm converges to an finite value

denoted as g∗

(t 1) (t 1) t (t 1) t Proof: Given F − , g(F − ,Y ) is convex and the global optimal solution of g(F − ,Y ) is given by the iterative solution (3.28). Therefore, g(F(t 1),Yt) g(F(t 1),Y(t 1)). Given − ≤ − − Yt, again g(Ft,Yt) is convex. The global optimal solution of g(Ft,Yt) is given by the iterative solution (3.33). Combining both results we have that g(Ft,Yt) g(F(t 1),Xt) ≤ − ≤ g(F(t 1),Y(t 1)), implying that the sequence g(Ft,Yt) is a monotone decreasing sequence. − − { } The result that the function g(Ft,Yt) is lower bounded is given in [123]. Since any lower bounded, monotone decreasing sequence is a convergent sequence, the proposed alternating algorithm converges to a local optimal value denoted as g∗ 

3.5 ACFP Using Latent Data Representation, Extractor and Reconstructor

In the previous sections, we performed modulation on the original data representation. In this section, we propose a slightly different concept in order to estimate a constrained latent representation that describes the content. We are interested not in any latent representation of the content, but in a redundant one that allows us to be robust to noise. In addition, by our general principle, we consider an estimate of a latent representation such that when it is perturbed by the noise: (i) the extractor function should provide the original features; (ii) the reconstructor function should recover the original data.

The general scheme of this method is shown in Figure 3.4 and we refer it as ACFP-LR in order to reflect a fact of latent representation. Compared to the ACFP concept, our extension focuses only on one specific case with two constraints. The first one is related to item (i) mentioned in the paragraph above under 3.5 ACFP Using Latent Data Representation, Extractor and Reconstructor 31

Original Data P1 F Fingerprint

Q ( Fs fs(Psx)) x P . PS

Reconstruction Feature Extraction Constraints Constraints hm,S Distortion B P

Z F Robustness 1 1 Fingerprint Modulation

y

hm,S BS PS

y = Zs zs(Bshm,s) Q ( Fs fs(Pshm,s)) ModulatedP Data PFingerprint

Fig. 3.4 A general scheme for ACFP-LR using a latent representation, extractor and recon- structor functions. noise perturbations of the latent representation. The second is similar to item (ii), but now the reconstructor function is applied only to the latent description and should provide the original data. The latter is also known as modulation. Nonetheless, the motivation is to increase the redundancy by adding one more element that compensates in the trade-off between modulation distortion and feature robustness. In other words, we try to add redundancy to achieve both constraints, the small modulation distortion and high feature robustness.

3.5.1 Contributions

In this section, we give the following contributions: (i) We introduce an extension of the core idea behind the ACFP by focusing on latent representation estimation with constraints imposed by the extractor and reconstructor functions pair 32 Estimation and Learning of NT Based Modulation for ACFP

(ii) We propose a generalized problem formulation with explicit regularization of the trade-off between the distortion and the robustness of the local features by considering constraints on the distribution of (a) the data modifications, (b) feature modifications and (c) on the actual latent representation of the content (iii) We show a reduction of the generalized problem formulation to a constrained projection problem under linear feature mapping and linear modulation which leads to an efficient solution.

3.5.2 ACFP-LR vs ACFP

Note that the core idea behind the ACFP modulation [43] and [78] was based on the ob- servation that the magnitude of the feature coefficients before the quantization influences the probability of the bit error in the descriptor bits. Descriptor bit flipping is more likely for low magnitude coefficients. Therefore, it is natural to modify the original content byan appropriate modulation and to increase these magnitudes subject to distortion constraints. The main idea behind the proposed ACFP-LR is to produce a data representation resilient to noise, such that after applying a feature generator function the resulting features are robust. At the same time, the reconstructor (modulation) function applied on the latent representation should give the original data. To fulfill these requirements, ACFP-LR consist of two operational modes: (i) modulation and (ii) verification. The modulation estimates the latent representations that describe the content. During verification, the features from the noise perturbed latent representation are extracted and the fingerprint is computed.

3.5.3 PCFP, ACFP and ACFP-LR: Feature Generation

A shared component in PCFP, ACFP, and ACFP-LR is the feature extraction. Assume that from the original image a patch is obtained, which is denoted as x ℜN. We consider a ∈ generalized feature compositional case (with or without nonlinearity), where the extraction of the local features is defined as follows:

S fo = ∑ Fs fs(Psx), (3.35) s=1

M N L M M 1 M Ps ℜ and F = [F1,...,FS],Fs ℜ are linear maps and fs : ℜ ℜ are ∈ × ∈ × × → functions describing an element-wise nonlinearity, L, M and N are the lengths of the final, intermediate and input data representation, respectively, and s 1,..,S . The feature ∈ { } extraction is followed by quantization Q(.) that results in a quantized local descriptor denoted 3.5 ACFP Using Latent Data Representation, Extractor and Reconstructor 33

L as bo = Q(fo) 0,1 , where ∈ { }

1, if fo(i) > 0. Q( fo(i)) = , i 1,...,M . (3.36) 0, otherwise. ∀ ∈ { }  3.5.4 ACFP-LR: Reconstruction from Latent Variables

In ACFP, the concept is based around the modulated data y ℜN that should be close to the ∈ original data representation x. The ACFP-LR takes into account the reconstruction from S L latent representations hs ℜ , s 1,...,S to x. Assuming that the S latent variables hs are ∈ ∈ { } given then a general reconstruction function is defined as follows:

S xˆ = ∑ Zszs(Bshs), (3.37) s=1

M L N M M 1 M where Bs ℜ and Z = [Z1,...,ZS],Zs ℜ are linear maps and zs : ℜ ℜ ∈ × ∈ × × → are functions describing an element-wise nonlinearity.

3.5.5 Problem Formulation

The general problem formulation for the estimation of the underlying data representation that describes the content is similar to in the previous Chapter, except that now the modulated data is replaced with a redundant representation h. This leads to the following optimization problem: hˆ = argminϕ (v(h),x) + λ1ψ (g(h),τ) + λ2r (h), (3.38) h where x is the original data, h = [h1,..,hS] is the latent data representation, v(h) is the recon- structor function, g(h) is the generator function and the function r(h) imposes constraints on the properties of h. The modulation level and the Lagrangian variables are denoted as τ,

λ1 and λ2, respectively. The first mapping function v(h) is the reconstructor function that is applied to the latent representation h in order to match v(h) to the original data representation xo, where ϕ (v(h),x) is a function that penalizes the distortions in the original data domain. The second function g(h) transforms h into features and tries to make g(h) robust, where ψ (g(h),τ) is a function that penalizes non-robust feature components. In ACFP-LR the focus is not the actual data, but rather the latent representation h. 34 Estimation and Learning of NT Based Modulation for ACFP

3.5.6 Reduced Problem Under Linear Feature Extraction and Recon- struction

The ACFP-LR considers a setup of linear modulation under the assumptions as given 2 below. The function ϕ (v(h),x) is defined as ϕ (v(h),x) = x v(h) 2. The function ∥ − ∥ M SN ψ (g(h),τ) is replaced by an explicit inequality constraint Fh e τ1, where F ℜ | | ≥ ∈ × and e represents an element-wise inequality. It is assumed that there is no element-wise ≥ nonlinearity, i.e., zs(hs) = hs and no specific function r(h) is defined other then the explicit inequality constraint Fh e τ1. | | ≥ Linear Generator (Feature Extraction) The feature extraction function is defined as the linear version of (3.35). Assuming the modulated data hs are given, the features are extracted as follows: S f = g(h) = ∑ FsPshs, (3.39) s=1 where the linear maps Ps are generated at random. L N The linear maps Fs are defined using a constraint matrix C 1,0,+1 , data ∈ {− } × samples (or their clusters) and random sampling. We describe the construction of the linear maps Fs in the following. Let C be the matrix that encodes the m-wise (pair-wise, triple-wise, etc.) constraints that describe the geometrical configuration of the considered data (pixel) N interactions. Given a data set of patches, assume that M centroids [w1,w2,...,wM], ws ℜ ∈N N are estimated using a k-means algorithm. Denote a transformation matrix as Ts ℜ × , T 1 N N N N ∈ where Ts = Rs wsw − ℜ . A key here is how we construct Rs ℜ . We proceed s ∈ × ∈ × T J as follows. First, we quantize the matrix wsws in levels, while for every quantization level T q 1,2,3,...,J , we build a set Lq of indexes to the elements in wsw . Then for every ∈ { } s index set Lq, the corresponding elements of Rs are generated from a uniform distribution with support [0,1]. The main idea is to try to have an equal contribution of the elements of T T 1 wsws in the linear feature map CTs. Having CTs, where Ts = Rs[wsws ]− , the linear map Fs is estimated as follows: T Σp 0 T Fs = U V , (3.40) " 0 0# ! T T where UΣV is the SVD of (CTs) and Σp is a diagonal matrix having p non-zero diagonal elements equal to the largest p singular values from Σ. 3.5 ACFP Using Latent Data Representation, Extractor and Reconstructor 35

Linear Reconstruction (Modulator) Let Bs = I and Zs = I, then the simplest modulation on multiple latent variables is defined as:

y = v(h) = ∑zs(hs). (3.41) s

Note that by using (3.41) and ϕ (v(h),x) the modulation and the reconstruction are seen as equivalent. In the following, we present the problem formulation under the previous assumptions.

Proposition 3: The ACFP-LR under linear modulation, linear feature map and convex constraints on the properties of the latent representations is a constrained projection problem:

ˆ 1 2 h =argmin x [IN N,...,IN N]h 2, subject to Fh e τ1, (3.42) h 2∥ − × × ∥ | | ≥ where IN N is an N N identity matrix, F = [F1P1,...,FS 1PS 1,0L N], 0L N is a zero × × h1 − − × × . N matrix with dimensions L N, h = . , hs ℜ are the latent representations, while × ∈  hS  S y = [IN N,...,IN N]h = ∑ hs, (3.43) × × s=1 is the reconstructed (modulated) data, FsPs are the linear feature maps and s 1,...,S 1 . ∈ { − } The goal is to allow an arbitrary large distortion between x and hi, but, very small distortion h1 S . between x and y = ∑s=1 hs and robust features f = [F1P1,...,FS 1PS 1] . that would − − hS 1  −  satisfy the constraints in (3.42). The key is to estimate all hs and use every modulated component ys independently, rather then the original data x. In this way the additional element that is added to the concept of ACFP is the redundancy. Note that, in the case of (3.42), the trade-off is between modulation distortion, feature robustness and the amount of redundancy.

Verification Only the first S 1 latent representations hs from h are involved in the constraint − in (3.42). The inequality [F1P1,...,FS 1PS 1,0L N]h e τ1 enforces these first S 1 latent | − − × | ≥ − representations ys to be sparse. There is no constraint on the variable hS that appears in the 2 cost function x ∑ hs of (3.22). Therefore, hS is not sparse and it will always have a ∥ − s ∥2 larger ℓ2-norm than the ℓ2-norm of the rest hs,s 1,...,S 1 . ∈ { − } This is important at the verification stage. Since even if we have all hs but we do not know the component hS, if we have the prior knowledge that the ℓ2-norm of hS is greater then any of the ℓ2-norms of the rest of hs then the component hsˆ can be simply estimated by 36 Estimation and Learning of NT Based Modulation for ACFP computing the following: sˆ = arg max hs 2. (3.44) 1 s S ∥ ∥ ≤ ≤ At the verification stage, it is assumed that noise is independently added to every latent component hs. Afterwards, the noise nN ,s corrupted latent component is vs = hs + nN ,s while the fingerprint is estimated as:

S 1 − by = Q ∑ FsPsvs . (3.45) s=1 !

3.6 Computer Simulations

This section validates the proposed approaches by numerical experiments and demonstrates the advantages of the ACFP, AFIL and ACFP-LR. The performance is evaluated under several signal processing distortions, including AWGN, lossy JPEG compression and the projective geometrical transform. The UCID [135] image database was used to extract local image patches. The ORB detector [128] was run on all images, and √N √N pixel patches, √N = 31 were extracted × around each detected feature point. The features were sorted by scale-space, 30 patches were used from each individual image.

3.6.1 Numerical Experiments Setup

In the following text, we describe the distortions that are used in our numerical simulation.

AWGN The results from a single patch were obtained as the average of 100 AWGN 2 realizations. Four different noise levels were used, defined as PSNR= 10log 255 are 0dB, 10 σ 2 5dB, 10dB and 20dB. The used modulation level (mL) is 60 for PCFP, ACFP and ACFPL, and mL = 100 for ACFP-LR.

Lossy JPEG Compression Three JPEG quality factor (QF) levels 0, 5 and 10 were used. The used modulation level (mL) is 30 for PCFP, ACFP and ACFPL, and mL = 100 for ACFP-LR.

Affine Transform with Lossy JPEG Compression An affine transformation P ℜ3 3 was ∈ × used, where: 1.0763 0.0325 0 P =  0.0119 1.09 0 , 24.32 70.37 1  − −    3.6 Computer Simulations 37 followed by a lossy JPEG compression with QF=5. The used modulation level is 60 for PCFP, ACFP and ACFPL, and mL = 100 for ACFP-LR.

3.6.2 Measures

In the following text, we define the three measured quantities that are used our the evaluation.

S 1 Modulation Level Define t = Fx for ACFP and AFIL, or t = FT x = ∑s=−1 Fs x, where S 1 FT = ∑ − Fs for ACFP-LR. Then let s, s(i) s( j) , i j,i, j 1,2,3,...,L be a s=1 | | ≤ | | ∀ ≤ ∈ {  } sorted t vector and let FT1 be rows reordered F (or FT ) such that FT1x = s. The modulation | |  level mL is defined in percentage mL = K 100,1 K L and it represents the fraction of L ≤ ≤ coefficients s that are modified. At a single modulation level, the modulation threshold τ for the ACFP-LR method is defined as τ = 100max1 i K s(i) , for the ACFP and ACFPL ≤ ≤ | | τ = max1 i K s(i) . ≤ ≤ | | Modulation Distortion The modulation distortion for ACFP and ACFPL is defined as 2552 1 DWR = 10log , where ∆ = x y 2, and the modulation distortion for ACFP-LR 10 ∆2 N ∥ − ∥ 2552 1 is defined as DWR = 10log , where ∆ = x ∑ ys 2. 10 ∆2 N ∥ − s ∥   Probability of Bit Error The probability of bit error is defined by the probability of cor- 1 L rect bit pe = 1 pc, pc = ∑ I bx (i),by (i) with L = 256 bits, where bx = Q(Fx) for − L i=1 { } ACFP, AFIL, and bx = Q((∑s Fs)x) for ACFP-LR, by = Q(F(y + nN )), where nN is the introduced distortion for ACFP and AFIL, by = Q ∑s Fs hs + nN ,s , where nN ,s is the I introduced distortion for ACFP-LR and is an indicator function defined as:

1, if a == b. I a,b = (3.46) { } 0, otherwise.  3.6.3 ACFP 

In this section, we evaluate the proposed ACFP solutions over different linear feature maps that we summarize into two different cases.

L N ACFP Basic Case In this section, we evaluate the PCFP and ACFP. Two matrices F0 ℜ × L N ∈ and F1 ℜ × are used in the PCFP scenario. One matrix F1 is used in the ACFP scenario. ∈ N N A square matrix Ti,i 0,1 is used (M = N). The matrix F0 = CT0 where T0 ℜ × ∈ { } T ∈T represents low pass filter with 11 11 window. The matrix F1 = UIL MV where × × , ( )T = U V are obtained by singular value decomposition (SVD) of CT1 . The matrix T1 38 Estimation and Learning of NT Based Modulation for ACFP

PCFP PCFP pe pe F0 F1 F0 F1 0dB .26 .15 0 .05 0.3 AWGN 5dB .17 .12 QF 5 .03 .02 10dB .11 .09 10 .02 .01 20dB .04 .03 PCFP pe F0 F1 Projective QF=5 .08 .05

Table 3.1 The pe using PCFP under varying AWGN noise, varying JPEG compression levels and projective transformation with QF level of 5.

T 1 N N N N R xx − ℜ and R ℜ is random matrix, generated from a uniform distribution ∈ × ∈ × [ , ] T ( )T T = with the support 0 1 . The matrix F1 is the closest orthogonal to CT , satisfying F1 F1 I and thus easily invertible. The PCFP results are shown in Table 3.1 and the ACFP results are shown in Table

3.2. Overall, the ACFP scenario brings improvement. The results produced by F1 have consistently lower pe than the results produced by F0. The ACFP under AWGN has the greatest reduction in pe of 0.15 achieved at a AWGN level of 0dB and a modulation level mL = 60. Otherwise, in the same ACFP scenario, when comparing the results produced by

F1 to the ones produced by F1 with zero modulation, the greatest reduction is .08, achieved at an AWGN level of 10dB and modulation level mL = 60. Considering the ACFP under the lossy JPEG compression, the modulation improves the results, even at a small modulation level like mL = 30 and the greatest reduction in pe is .05, achieved at QF=0. The results produced by F1 are with lower pe compared to the results produced by F0. The reduction is .05, relatively to the result produced by the map F0. Higher ACFP modulation results in lower pe, as shown in Table 3.2. At modulation level 60 the pe is lower than the one at modulation level 10.

ACFP with Approximative Linear Feature Map Two cases are simulated: an informed and non-informed one. Consider that a local image block is transmitted thought a noisy channel. Informed Case In the informed case, we assume that the information about the original − image is presented at the receiver end. One matrix FI is used in the ACFP scenario. FI = T T T UIL MV where U,V are obtained by singular value decomposition (SVD) of (CTI) . ×  3.6 Computer Simulations 39

ACFP ACFP pe pe mL 10 60 mL 10 30 DWR 51 28 DWR 51 42 0dB .15 .11 0 .02 0 AWGN 5dB .11 .05 QF 5 .01 0 10dB .08 .01 10 0 0 20dB .02 0 ACFP pe mL 10 60 DWR 51 28 Projective QF=5 .05 .03

Table 3.2 DWR and pe using varying ACFP modulation under under varying AWGN noise, varying JPEG compression levels and affine transformation with QF level of5.

T 1 N N N N The matrix TI = R xx − ℜ and R ℜ is a random matrix, that is generated ∈ × ∈ × T J as follows. First, the matrix xx is quantized in levels, where for every quantization level T q 1,2,3,...,J, there exists a set Lq of indexes to the elements in xx . All of the sets Lq ∈ are with the same cardinality. Then for every set of indexes Lq the corresponding elements of R are generated from a uniform distribution with the support [0,1]. The main idea is to T allow the elements of xx to equi-likely contribute in the linear feature map CTI. The matrix T T T FI is the closest orthogonal to (CTI) , satisfying FIFI = I. Non Informed Case In the non-informed case, information about the original image − is not presented. Three different matrices are used. The matrix F = CT, where T ℜN N ∈ × represents low pass filter with a 11 11 window, where FFT = I. The matrix F† is the closest × ̸ orthogonal matrix to F i.e., F† F with F†(F†)T = I while Fr is a random matrix, where ≃ i, j,Fr(i, j) N(0,1) with Fr(Fr)T = I. These three linear feature maps are used in the ∀ ∼ PCFP and in the ACFP scenario. The results are shown in Tables 3.3, 3.4 and 3.5. In summary, the results confirm that the modulation distortion and the probability of bit error depends on the ability of the linear map to produce robust features and the properties of the linear map, related to the optimal linear modulation. The results produced by the proposed linear modulation demonstrate that small pe is achievable under different and severe signal processing distortions. However, at cost of introducing modulation distortions. 40 Estimation and Learning of NT Based Modulation for ACFP

PCFP results over non-informed and informed case

† r FF F FI pe 0dB .224 .422 .3485 .15 AWGN 5dB .150 .373 .2661 .12 10dB .095 .310 .1795 .09 20dB .034 .160 .0640 .03 0 .082 .244 .072 .03 QF 5 .051 .190 .056 .02 10 .028 .144 .044 .01 Proj., QF=5 .058 .233 .0769 .05

Table 3.3 DWR and pe under PCFP using varying AWGN noise, JPEQ quality factor and † r affine transformation with QF=5 for the feature maps F, F , F and FI

3.6.4 AFIL

We consider three scenarios: PCFP, ACFP [77] and AFIL for this computer simulation. In order to make a fair comparison between PCFP, ACFP and AFIL, we use one predefined matrix F for the PCFP and ACFP scenario. For the AFIL we use the same matrix F to initialize the proposed algorithm and define the target labels as L = sign(FX). Half of the total 1000 image patches are used for AFIL learning with target feature L. The results are shown on Table 3.6. The results demonstrate that the proposed algorithm achieves small pe under different and severe signal processing distortions. More importantly the introduced modulation distortion is smaller by using a data adapted linear feature map compared to the modulation distortion by linear feature map without data adaptation.

3.6.5 ACFP-LR

This section validates the proposed approach by numerical experiments and demonstrates the advantages of the ACFP-LR. The performance is evaluated under several signal processing distortions, including AWGN, lossy JPEG compression and the affine geometrical transform. The results are compared with those from the PCFP, ACFP and AFIL schemes. Given the extracted image patches, PCFP, ACFP, AFIL and ACFP-LR used the following linear maps for feature extraction. In PCFP and ACFP we use F1. It is defined as F1 = T T T UIL MV where U,V are obtained by singular value decomposition (SVD) of (CT1) . × T 1 N N N N The matrix T1 = R xx − ℜ and R ℜ is a random matrix, generated from  ∈ × ∈ × [ , ] T a uniform distribution with the support 0 1 . The matrix F1 is the closest orthogonal to 3.6 Computer Simulations 41

ACFP results over informed case

FI FI mL 10 60 mL 10 60 DWR 8 -14 DWR 8 -14 pe pe 0dB .15 .11 0 .02 0 AWGN 5dB .11 .05 QF 5 .01 0 10dB .08 .01 10 0 0 20dB .02 0 FI mL 10 60 DWR 8 -14 pe Projective, QF=05 .05 .03

ACFP results over non-informed case

F F mL 10 60 mL 10 60 DWR 20.0 -6.9 DWR 20.0 -6.9 pe pe 0dB .220 .121 0 .082 .253 AWGN 5dB .145 .045 QF 5 .049 .217 10dB .086 .010 10 .022 .204 20dB .019 0 F mL 10 60 DWR 20.0 -6.9 pe Proj., QF=5 .053 .263

Table 3.4 DWR and pe using varying ACFP modulation under varying AWGN noise, JPEQ quality factor and Projective transformation with QF=5 for the feature maps F and FI 42 Estimation and Learning of NT Based Modulation for ACFP

ACFP results over non-informed case using row orthogonal linear feature maps and the original feature map

F† Fr F† Fr mL 10 60 10 60 mL 10 60 10 60 DWR 52.4 27.9 43.9 19.6 DWR 52.4 27.9 43.9 19.6 p e p 0dB .421 .400 .348 .297 e 0 .243 .218 .070 .050 AWGN 5dB .374 .337 .264 .184 QF 5 .190 .154 .054 .027 10dB .310 .248 .176 .072 10 .143 .088 .041 .009 20dB .157 .047 .053 .001 F† Fr mL 10 60 10 60 DWR 52.4 27.9 43.9 19.6 pe Proj., QF=5 .232 .210 .075 .049

ACFP results over non-informed case

FF† Fr mL 10 60 10 60 10 60 DWR 33.8 4.7 52.4 27.9 43.9 19.6 pe 0dB .217 .064 .421 .401 .348 .297 AWGN 5dB .142 .022 .374 .338 .264 .184 10dB .084 .005 .310 .249 .177 .073 20dB .018 0 .157 .047 .053 .001 0 .074 .025 .242 .219 .070 .050 QF 5 .040 .015 .190 .153 .054 .026 10 .015 .012 .143 .088 .041 .009 Proj., QF=5 .049 .048 .232 .209 .075 .046

Table 3.5 DWR and pe using varying ACFP modulation under varying AWGN noise, JPEQ quality factor and affine transformation with QF=5 for the feature maps F, F† and Fr 3.7 Summary 43

pe PCFP ACFP AFIL mL 10 60 10 60 DWR 33.8 4.7 36.1 6.9 0dB .224 .217 .064 .216 .063 AWGN 5dB .150 .142 .022 .14 .022 10dB .095 .084 .005 .085 .004 20dB .034 .018 0 .019 0 0 .082 .074 .025 .075 .024 QF 5 .082 .040 .015 .041 .015 10 .028 .015 .012 .015 .011 Proj., QF=5 .058 .049 .048 .049 .047

Table 3.6 DWR and pe using varying ACFP modulation under varying AWGN noise, JPEQ quality factor and affine transformation with= QF 5.

T T (CT) , satisfying F1 F1 = I and easily invertible. For AFIL, we use the matrix FL that is learned from half of the total available patches. The remaining half is used for evaluation. In

ACFP-LR we use the matrix F = [F1P1,...,FSPS], where Fs and Ps are defined as in Section 3.5.6, s 1,..,S 1 and the number of redundant representations is set to S = 12. The ∈ { − } solution of (3.22) is found using the CVX [51] publicly available solver. The results are shown in Tables 3.7, 3.8, 3.9 and 3.10. In summary, these evaluations show that using the proposed approach, it is possible to achieve a small modulation error and very high robustness on the features, even under severe noise levels. However, this performance is at the cost of adding redundancy in the underlying representation and the ability to differentiate hS from the rest hs,s 1,...,S 1 under noise perturbations. More importantly, ∈ { − } compared to the PCFP, ACFP and ACFPL schemes, the ACFP-LR demonstrated superior performance under AWGN, lossy JPEG compression and affine geometrical transform distortions .

3.7 Summary

In this section, we introduced a generalized ACFP problem formulation and presented a reduction of the ACFP problem to a constrained projection problem under linear modulation and linear feature map. Considering the reduced problem, under invertible linear feature maps we presented the closed form solution next to addressing the approximations to the linear feature map in order to attain a low modulation distortion. 44 Estimation and Learning of NT Based Modulation for ACFP

pe pCF ACFP ACFPL ACFP-LR 0dB .149 .064 .063 .009 AWGN 5dB .12.1 .022 .022 .008 10dB .092 .005 .004 .007 20dB .028 0 0 0

pCF ACFP ACFPL ACFP-LR mL 0 60 60 100 DWR ∞ 4.7 6.9 291

Table 3.7 DWR and pe of under different Additive White Gaussian Noise (AWGN) level.

pe pCF ACFP ACFPL ACFP-LR 0 .082 .025 .024 .012 QF 5 .082 .015 .012 .010 10 .028 .012 .011 .007

pCF ACFP ACFPL ACFP-LR mL 0 30 30 100 DWR ∞ 4.7 6.9 291

Table 3.8 DWR and pe under different JPEQ Quality Factor (QF).

Furthermore, we proposed active content fingerprint learning named AFIL with the objective to estimate the data adaptive linear map that provides small ACFP modulation distortion and features with the targeted properties. We presented a novel problem formulation that jointly addresses the fingerprint learning and the content modulation. We proposed a solution with an iterative alternating algorithm with global optimal solutions for the respective iterative steps and a convergence guarantee to a local optimal solution. Finally, we introduced the concept of ACFP-LR and proposed a novel general problem formulation described by a latent representation, extractor and reconstructor functions. A linear modulation was addressed on the latent data representation with constraints on the modulation distortion, while using linear feature map. We provided a computer simulation using local image patches extracted from a publicly available data set. We highlight that ACFP-LR demonstrated a superior performance under AWGN, lossy JPEG compression and affine geometrical transform distortions compared to the PCFP, ACFP and ACFPL schemes. 3.7 Summary 45

pe pCF ACFP ACFPL ACFP-LR Projec. QF=5 .058 .048 .047 .04

pCF ACFP ACFPL ACFP-LR mL 0 60 60 100 DWR ∞ 4.7 6.9 291

Table 3.9 DWR and pe under JPEQ quality factor QF=5 and affine transformation.

pe pe -10dB .015 -40dB .071 AWGN -20dB .019 AWGN -50dB .301 -30dB .020 -60dB .321

DWR DWR mL 100 291 mL 100 291 pe pe 3 .042 -10dB .061 Proj. QF 2 .054 Proj. AWGN -20dB .074 1 .061 -30dB .123

DWR DWR mL 100 291 mL 100 291 Table 3.10 DWR and pe for ACFP-LR under affine transform and extremely low QF and high AWGN levels.

Chapter 4

Learning NT for Image Denoising

The image denosising problem is traditionally considered as a benchmark for the comparison of the basic properties between different sparse models. The sparsifying transform model can be considered as a special case of our NT model. Therefore, in this chapter, we consider the learning of the special case of our NT model for image denoising. In the past, three main models for sparse signal representations were proposed: the syn- thesis model [90], the analysis model (noisy signal analysis model [117]) and the sparsifying transform model [123]. Learning any one of the models for sparse signal representation is challenging, especially when the model matrix is overcomplete1. Several algorithms [56], [102] and [120] were proposed for learning analysis and sparsifying models with a well conditioned, non-structured and overcomplete matrix. To find a solution, these algorithms typically alternate between an update on the transform matrix and an estimate for the sparse representations. Usually, the transform update step is based on the gradient of the objective, where a solution is obtained by iteratively taking one or several gradient steps. Depending on the used algorithm for an update, this might add computational complexity. On the other hand, the existence of a closed form solution (unique or not) w.r.t. the optimization objective or its approximation in the transform update step and the algorithm convergence for that case is not fully explored.

4.1 Contributions

In this section, we address the sparsifying transform model with an overcomplete matrix and present the following main contributions: 1 M N A matrix A ℜ × is said to be overcomplete if M > N. Equivalently, if the number M of columns N T ∈ am ℜ in A is bigger than the dimensionality N of am, i.e., M > N, we might also say that the set of vectors ∈ a1,a2,...,aM is linearly dependent and that this set forms a frame. { } 48 Learning NT for Image Denoising

(i) We propose an iterative, alternating algorithm for learning an overcomplete sparsifying transform with two steps: transform update step and sparse coding step (ii) We introduce a constrained problem formulation for the transform update step with an objective that represents a lower bound approximation to the original objective of the related transform estimation problem (iii) We propose an approximate closed form solution that addresses a trade-off between (a) how much the gradients of the approximative objective and the original objective are aligned and (b) how close is the lower bound to the original objective (iv) We give a convergence result for the iterating sequence of the objective function values generated by the iterating steps of the proposed algorithm with exact and approximate closed form solutions (v) We present an evaluation by computer simulation for an image denosing application, showing competitive performance, while using a small amount of the noisy data for learning.

4.2 Related Work

Image denoising methods can be grouped into two broad categories: (i) single image based denoising; (ii) image denoising using external image data set.

This grouping is based on whether only the noisy image or additional external data are used. In the recent years, deep neural networks were used extensively to model external image knowledge in the form of a mapper to a denoised estimate or in the form of a "deep" prior, which is used in order to preform the actual denoising. While these methods have provided good results, they require a large amount of external data, usually need long training times and are difficult to analyze w.r.t. the optimal network structure and parameters. Amongs the single image based denoising methods the most prominent ones are based on self-collaborative filtering, non-local means and the sparsity based models that usea dictionary. The main differences between them are not only the fact that the self-collaborative filtering and the non-local means based methods use predefined filters, but also that theformer two incorporate information about the similarity match between the local image patches in the denosised estimate. In contrast, in the sparsity based models a dictionary is learned using the noisy image which allows a data adaptation of the dictionary words. In general, the advantage of single image based denoising compared to the image de- noising methods that use an external image data set is not only the learning and denoising 4.2 Related Work 49 efficiency, but also the ability to interpret and characterize the optimal solution. An additional potential is the possibility to highlight the involved trade-offs w.r.t. the used amount of nosy images and the quality of solution, which has a great significance in a practical application.

4.2.1 Sparse Models for Image Denoising

Along the line of our work, we introduce the common sparse signal models that can be used not only for image denoising, but also for other image restoration problems.

Synthesis model As the name suggests, the synthesis model synthesizes a data sample x ℜN as an approximation by a linear combination y ℜM (referred to as a sparse data ∈ ∈ representation) of a few words (frame vectors) y 0 << M, from a dictionary (frame) D ∥ ∥ ∈ ℜN M, i.e., x = Dy + v, where v ℜN denotes the approximation error. With the synthesis × ∈ model approach the data reconstruction is addressed explicitly. This model assumes that the data x lies in the column space of the dictionary D, with the error vector v defined in the original data domain. The two main open issues with this model are the high computational complexity for the learning of the dictionary D and the estimation of the sparse representation y. The problem of learning a synthesis dictionary is NP-hard. Nonetheless, in the recent years many algorithms have been developed for the solution of the dictionary learning problem [90], [94], [29], [49] and [2].

Analysis model This model uses a dictionary Φ ℜM N with M > N to analyze the signal ∈ × x ℜN. This model assumes that the product of Φ and x is sparse, i.e., y = Φx with ∈ M y 0 = M s, where 0 s M is the number of zeros in y ℜ [127] and [56]. The ∥ ∥ − ≤ ≤ ∈ vector y is the analysis sparse representation of the data x w.r.t. Φ. If the data sample x is known, its analysis representation w.r.t. a given Φ can be obtained via multiplying x by Φ. However, when the observed signal is contaminated by noise, the clean signal x has to be estimated first in order to get its analysis representation, which leads tothe analysis pursuit problem [127]. Several algorithms have been proposed for analysis dictionary learning [127], [56], [161], [102] and [136]. The authors in [36] give a comprehensive overview of different learning methods for the analysis model. For this class of algorithms the computational complexity ever higher compared to the previous model if the analysis pursuit problem [127] is considered, coupled with the estimate of its dictionary.

Transform model In contrast to the synthesis model and similarly to the analysis model, the sparsifying transform model does not target the data reconstruction explicitly. This model assumes that the data sample x is approximately sparsifiable under a linear transform M N M A ℜ , i.e., Ax = y + z, z ℜ , where y is sparse y 0 << M. The error vector ∈ × ∈ ∥ ∥ 50 Learning NT for Image Denoising z is defined in the transform domain, which is different compared to the two previous models. Note that the first advantage of the sparsifying transform model is that it extends and represents a generalization of the analysis model [121] since there is no explicit assumption on the sparse representation y or on the data sample x. The sparse encoding in this model is a direct problem, which is a converse to the inverse problem in the synthesis model. The sparsifying transform model was introduced in [119]. The sparsifying transform with a square matrix was studied in [121], the sparsifying transform with a structured set of square matrices and non-structured overcomplete matrix A ℜM N, M N were studied in [120], ∈ × ≥ [152] and [153].

4.3 Learning an Overcomplete and Sparsifying Transform with Exact and Approximate Closed Form Solutions

According to our NT model, which was introduced in Chapter2, we have that:

p(yl xl,A) ∝ p(xl yl,θ ,A)p(yl,θ A)dθ , (4.1) | θ | | Zθ where l 1,...,L . In order to reduce it to a sparsifying nonlinear transform (sNT) model, ∈ { } we disregard the parameter θ , thus we have:

p(xl yl,θ ,A)p(yl,θ A)dθ = p(xl yl,A)p(yl A), (4.2) θ | | θ = | | Z {} 1 2 where p(xl yl,A) ∝ exp( Axl yl ) models the sparsifying transform error, p(yl A) | − β0 ∥ − ∥2 | models a sparsity inducing prior, which we further simplify by assuming that:

p(yl A) = p(yl), (4.3) | M N while p(A) ∝ exp( Ω(A)) is the prior on the linear map A ℜ × with prior measure M 1 − ∈ N 1 Ω(.), yl ℜ is the sparsifyed transform representation, xl ℜ is the original data ∈ × ∈ × representation and β0 is a scaling parameter. Under our model assumptions, we minimize the negative logarithm of p(Y,A X) = CK | p(Y X,A)p(A X), which is proportional to ∏ p(xl yl,A)p(yl A) p(A) under the as- | | i=1 | | sumption that p(A) = p(A X). The resulting problem formulation is the following: |   CK ˆ ˆ Y,A = argmin ∑ [ log p(xl yl,A) log p(yl)] log p(A), (4.4) { } Y,A i=1 − | − − 4.3 Learning an Overcomplete and Sparsifying Transform with Exact and Approximate Closed Form Solutions 51

N L M L where X = [x1,...,xL] ℜ and Y = [y1,...,yL] ℜ . ∈ × ∈ × As noted in Chapter2, depending of the chosen measure for the factored probabilities, the solution to (4.4) might be difficult to compute, since depending on these measure, the partitioning function might have to be evaluated only by integration. In the following section we describe the problem formulation where we overcome the potential difficulty of (4.4). Indeed, we address a problem formulation where we disregard the partition functions in the factored probabilities of our model and instead of the sparsity prior, we impose an equivalent explicit sparsity constraint.

4.3.1 Problem Formulation

N L N Assume a data matrix X ℜ is given that has as columns data samples xl ℜ , where l ∈ × ∈ ∈ L = 1,...,L and L is the number of data samples. We address the learning of approximately { } sparsifying transform with an overcomplete transform matrix A ℜM N, M > N by the ∈ × following problem formulation:

ˆ ˆ 1 2 A,Y = argmin AX Y F +Ω(A), subject to yl 0 s, l L , (4.5) { } A,Y 2∥ − ∥ ∥ ∥ ≤ ∀ ∈ where . F and . 0 denotes the Frobenius and ℓ0-"norm", respectively and Y = [y1,...,yL] M L ∥ ∥ ∥ ∥ ∈ ℜ × has as columns the transform representations yl. The first term in (4.5) is the sparsi- fication error [123]. It represents the deviation of the linear transform representations AX from the exact sparse representation Y in the transform domain. The penalty Ω(A) on the transform matrix A is defined as:

λ1 2 λ2 T 2 T Ω(A) = A + AA I λ3 log detA A , (4.6) 2 ∥ ∥F 2 ∥ − ∥F − | | where λk are Lagrangian multipliers k 1,2,3 . The second term Ω(A) and the penalty ∀ ∈ { } yl 0, l L induce constraints on the properties of the matrix A and the transform rep- ∥ ∥ ∀ ∈ resentations Y, respectively. The A 2 penalty helps regularize the scale ambiguity in the ∥ ∥F solution of (4.5), that occurs when the data samples have representations with zero valued components. The log det (AT A) and A 2 are functions of the singular values of A and | | ∥ ∥F together help regularize the conditioning of A [121], [120], [153], [122]. Assuming that the 2 T expected coherence µ (A) between the rows am of A, i.e., A = [a1,...,aM] is defined as:

2 2 T 2 µ (A) = a am , M(M 1) ∑ | m1 2 | (4.7) m1,m2 M = 1,..,M :m1=m2 − ∀ ∈ { } ̸ 52 Learning NT for Image Denoising then the penalty AAT I 2 helps enforce a minimum expected coherence µ2 (A) and unit ∥ − ∥F ℓ2-norm for the rows of A. The transform data yl are constrained to have s non-zero elements by the sparsity inducing ℓ0-"norm" yl 0 s, l L , which in our model, is related to the sparsity prior p(yl). ∥ ∥ ≤ ∈

4.3.2 Two Step Iteratively Alternating Algorithm

Problem (4.5) is non-convex in the variables A,Y . If the variable A is fixed, (4.5) is { } convex, However, if Y is fixed (4.5) remains non-convex because the matrix AAT in the penalty function Ω(.) has the term AAT to the power of 2 and the penalty log detAT A . − | | To solve (4.5) we use an iterative, alternating algorithm that has two steps: transform estimate and sparse coding. In the transform estimate step, given Yt that is estimated at iteration t, we use an approximate closed form solution to estimate the transform matrix At+1 t+1 t+1 at iteration t + 1. In the sparse coding step, given A , the sparse codes yl are estimated by a closed form solution. Note that the steps of the algorithm are equivalent to 1) approximative maximization of ∏ p(xl yl,A) with prior p(A) over A and 2) approximative maximization of p(xl yl,A) l | | with prior p(yl) over yl.

Transform estimate Let the transform data Yt at iteration t be known, then problem (4.5) reduces to the problem of estimating the transform matrix At+1, that is defined as follows:

g(At+1,Yt ) (4.8) ˆ t+1 1 t+1 t 2 t+1 (P1) : A = argmin A X Y F + Ω(A ). At+1 z2∥ − }|∥ { Alternative Problem Formulation for Transform Estimate Instead of addressing prob- − lem (P1), in this work, we introduce a constrained problem and focus on an objective that is a lower bound on the objective in problem (P1), i.e.,

t+1 t t+1 t gε (A ,Y ) g(A ,Y ) ≤ 1 Aˆ t+1 = argmin Tr At+1XXT (At+1)T 2At+1G+(Yt)T Yt +Ω(At+1) At+1 z2 { − }| } { t+1 T 1 T subject to A = VS ΣAΣ− U (4.9) 1 Tr ΣA Tr ΣC , { } ≥ βλmin { } where:

t+1 t t+1 T t+1 T t+1 t T t t+1 gε (A ,Y ) = Tr A XX (A ) 2A G + (Y ) Y + Ω(A ), (4.10) { − } 4.3 Learning an Overcomplete and Sparsifying Transform with Exact and Approximate Closed Form Solutions 53 is the lower bound for:

g(At+1,Yt) = Tr At+1XXT (At+1)T 2At+1X(Yt)T + (Yt)T Yt + Ω(At+1), (4.11) { − } N M t T t+1 with G ℜ × to be an approximation to X(Y ) . We assume that A decomposes t+1∈ T 1 T M N N N N N as A = VS ΣAΣ− U , where V ℜ × , S ℜ × and U ℜ × are tall column- ∈ ∈ ∈ N N N N wise orthonormal, and square orthonormal matrices, respectively, ΣA ℜ × , Σ ℜ × , N N ∈ ∈ ΣC ℜ are diagonal matrices, λmin ℜ and β 0. ∈ × ∈ ≥ We derive the matrix ΣC and λmin based on the matrices V, S, Σ and U and the first order derivative of g(At,Yt) w.r.t. At estimated at iteration t. We use the con- 1 straint Tr ΣA Tr ΣC to ensure that the gradient of the approximated objective βλmin t+1 {t } ≥ { } t+1 t gε (A ,Y ) is preserved w.r.t. gradient of the original objective g(A ,Y ). In the following, we present the approximate closed form solution for (P1), which in fact represents the solution to (4.9). In addition, we explain how V, S, U, Σ and ΣA are estimated, and how G, ΣC and λmin are constructed. t M L N L Approximate Closed Form Solution Given Y ℜ × , X ℜ × , M N, λ1 − ∈ 2 T ∈ T ≥ ≥ 0,λ2 0,λ3 0, let the eigen value decomposition UΣ U of XX + λ1I and the sin- ≥ ≥ gular value decomposition SDVT of UT X(Yt)T exist, then if and only if Σ(n,n) = σ(n) > 0, n N = 1,...,N , the original problem (P1) has an approximate closed form solution in ∀ ∈ { } the form of: t+1 T 1 T A = VS ΣAΣ− U , (4.12) where ΣA is a diagonal matrix, ΣA(n,n) = σA(n) 0, and σA(n) are solutions to: ≥

4 2 σΓ(n) σˆA(n) = arg minc4(n)σA(n) + c2(n)σA(n) σA(n) f (σA(n)), σA(n) − σ(n) − 1 subject to σA(n) σC(n), (4.13) ≥ βλmin

2 λ2 σ (n) 2λ2 σ(n) with c4(n) = 4 ,c2(n) = 2 − , f (σA(n)) = 2λ3 log , σ (n) = T(n,n), T = SD, σ (n) σ (n) σA(n) Γ σC(n) = ΣC(n,n), n N . ∀ ∈ The diagonal matrix ΣC contains the singular values of:

∂g(At,Yt) T C = At, (4.14) ∂At   54 Learning NT for Image Denoising whereas:

t t ∂g(A ,Y ) t t T t t t T t t T 1 = (A X Y )X + λ1A + λ2(A (A ) I)A λ3((A ) )− , (4.15) ∂At − − − is the first order derivative of g(At,Yt) w.r.t. to the estimate At at iteration t. The variable

λmin is the smallest singular value of:

∂g(At,Yt) T F = Σ 1UT VST , (4.16) − ∂At     and β 0. ≥ T t T Proof : The matrix G = ΣΓV X(Y ) . In the trace form, this approximation results in the ≃ 1 1 t+1 t lower bound inequality Tr ΣAΣ− ΣΓ Tr ΣAΣ− T and ensures that gε (A ,Y ) − { } ≤ − { } ≤ g(At+1,Yt). The proof is given in Appendices B.1 and B.2 together with the proof about the existence of a closed form solution in the form of (4.12) with a solution for ΣA without the constraints in (4.13). The constraint (4.13) is important in order to guarantee the preservation of the gradient t+1 t t+1 t ∂gε (A ,Y ) of the approximative objective g (At+1,Yt) w.r.t. gradient ∂g(A ,Y ) of the ∂At+1 ε ∂At+1 original objective g(At+1,Yt). t t t t ∂gε (A ,Y ) We assume that the solutions A and Y , and the gradient ∂A at iterations t are t+1 t known. In order to preserve the gradient of the approximative objective gε (A ,Y ) the solution for At+1 should be estimated such that it holds:

∂g(At,Yt) T Tr At βAt+1 0, (4.17) ∂At − ≤ (  )  where At βAt+1 is descend direction only if (4.17) holds true and β 0. We denote − T ≥ T ∂g(At ,Yt ) t 1 T ∂g(At ,Yt ) T C = ∂At A and by using (4.12), we denote F = Σ− U ∂At VS . To further simplify, we use C and F and express the left hand side of (4.17 ) as Tr C βFΣA . { − } By using the smallest singular value λmin of the matrix F we have the following bound:

Tr C βλminΣA Tr C βFΣA , (4.18) { − }≥ { − } which represents an upper bound on the condition for the preservation of the gradient (4.17). When we reorder the left hand side of (4.18), impose that the upper bound is nonnegative from above, i.e., 0 Tr C βλminΣA , and consider a more strict, element-wise condition ≥ { − } 4.3 Learning an Overcomplete and Sparsifying Transform with Exact and Approximate Closed Form Solutions 55 then we arrive at: 1 σA(n) σC(n), (4.19) ≥ βλmin where σC(n) = ΣC(n,n) and ΣC is a diagonal matrix with the diagonal elements equal to the singular values of C  Trade-Off Gradient Alignment and Lower Bound Tightness We use bounds in the form − of: 1 1 Tr ΣAΣ− ΣΓ Tr ΣAΣ− T , − { } ≤ − { } (4.20) 0 Tr C βΣF ΣA Tr C βFΣA , ≥ { − } ≥ { − } where ΣF (n,n) = λmin, n N . The first bound is related to the approximated objective t+1 t ∀ ∈ t+1 t gε (A ,Y ), that is, its left hand side is related to G that appears in gε (A ,Y ) and the second bound is related to the constraint for the preservation of the gradient. The bounds t+1 t t+1 t (4.20) address a trade-off between how much the gradients ∂g(A ,Y ) and ∂gε (A ,Y ) of the ∂At+1 ∂At+1 approximative objective and the original objective, respectively, are aligned, and how much t+1 t t+1 t the lower bound gε (A ,Y ) is close to the objective g(A ,Y ). The bounds (4.20) offer three advantages in the proposed solution. First, their use results in an approximate closed form solution expressed by (4.12) with (4.13). Second, they allow to estimate At+1, which can lead to accelerated convergence. That is, they allow to estimate a descend direction At βAt+1 such that the original objective g(At+1,Yt) is − rapidly decreased. Third, the bounds (4.20) enables us to estimate At+1 that can lead to a satisfactory solution under small amount of data, where the key is again the trade-off between the lower bound approximation and the alignment of the gradient that is addressed with (4.20). While (4.20) only describes the trade-off, its limits and its optimal characterization w.r.t. the acceleration and the required minimum amount of data for an acceptable solution are out of the scope of this paper. Nonetheless, using the proposed bounds (4.20), we empirically demonstrate that indeed the proposed solution for an image denoising application exhibits the aforementioned advantages.

t+1 Sparse coding Given A , for xl,l L , the sparse coding problem is formulated as ∀ ∈ follows: t+1 t+1 2 yˆl = argmin A xl yl 2, subject to yl 0 s, (4.21) yl ∥ − ∥ ∥ ∥ ≤ t+1 t+1 where we denote ql = A xl and use the optimal solution as proposed in [120], i.e.,

t+1 t+1 q j (m), if q j(m) s yˆj (m) = | | ≥ , m 1,...,M . (4.22) 0, otherwise, ∀ ∈ { }   56 Learning NT for Image Denoising

4.3.3 Local Convergence

Since in the transform update, the approximate closed form solution preserves the gradient and in the sparse coding step, we use an exact closed form solution, the following allows us to state and prove a local convergence result.

Theorem 2 Given data X and a pair of initial transform and sparse data A0,Y0 , let { } At,Yt denote the iterative sequence generated by solution (4.12) with (4.13) and the { } closed form solution of (4.21). Then, the sequence of the objective function values g(At,Yt) is a monotone decreasing sequence, satisfying g At+1,Yt+1 g At+1,Yt g(At,Yt), ≤ ≤ and converges to a finite value denoted asg ∗.  

Proof: In the transform update step, Yt is fixed and an approximative minimizer is obtained t+1 t+1 t t+1 t t+1 t t+1 t w.r.t. A , with gε (A ,Y ) g(A ,Y ). Therefore, gε (A ,Y ) g(A ,Y ). In ≤ ≤ the sparse coding step an exact solution is obtained for Yt+1 with fixed At+1. Therefore, g(At+1,Yt+1) g(At+1,Yt), holds trivially. Combining the results for the two steps, we ≤ have g(At+1, Yt+1) g(At,Yt) for any t. Since the function g(At,Yt) is lower bounded ≤ [123], the sequence of the objective function values g(At,Yt) is monotone decreasing and lower bounded, therefore it converges  Given the convergence proof, we can claim that our iterative algorithm that is based on integrated marginal maximization allows us to only find a joint local maximum in A and Y L for ∏ p(xl yl,A)p(yl)p(A). l=1 | Since we use an ε-Close Approximative solution in the Transform estimate step we named our algorithm as εCAT.

4.3.4 Image Denoising With εCAT

We denote an original and noisy image with dimensions S1 S2, represented in a vector × format by x ℜS1S2 and q = x + g ℜS1S2 , where g ℜS1S2 is the noise. A clean and noisy ∈ ∈ ∈ image block are denoted as:

N xl = Elx ℜ , i L , and ∈ ∀ ∈ (4.23) N ql = Elq ℜ , i L , ∈ ∀ ∈ N S S respectively, where the matrix El 0,1 1 2 is used to extract clean (or noisy) image ∈ { } × block at location i and L is the index set of all image block locations. 4.3 Learning an Overcomplete and Sparsifying Transform with Exact and Approximate Closed Form Solutions 57

The extension of (4.5) for block level image denoising is formulated as:

L ˆ ˆ ˆ 2 2 X,Y,A = arg min ∑ Axl yl 2 + τ xl ql 2 + Ω(A) { } X,Y,A i=1 ∥ − ∥ ∥ − ∥ (4.24)

subject to yl 0 sl,i I , ∥ ∥ ≤ ∈ M N N where A ℜ × is the transform matrix, xl ℜ is the estimated original image block, M ∈ ∈ yl ℜ is the sparse transform representation with sparsity level sl and τ is a parameter ∈ inversely propositional to the noise variance σ 2. Note that by using (4.12), the pseudo-inverse of A exists as:

† 1 T A = UΣΣΣΣA− SV . (4.25)

† Furthermore, given A and yl, (4.24) approximately reduces to the constrained projection problem: † 2 2 (PD) : xˆl = argmin xl A yl 2 + τ xl ql 2, (4.26) xl ∥ − ∥ ∥ − ∥ for the variable xl, and its closed form solution for individual image block xl can be computed as: † √τI √τq l √ † (4.27) xˆl = † = e1 τql + e2A yl, " I # " A yl # † √τI √τI where is the pseudo-inverse of (two concatenated diagonal matrices √τI " I # " I # †2 and I) and the solution is easily computed as [e1,e2] = √τ,1 . The denoising problem , (4.24) is non-convex in the variables xl yl and A together. Similarly to [120] and [2], we use an iterative procedure that has two steps. In the first stepTransform ( estimate update), xl = Elq is fixed, the initial sparsity is setto sl = s and the overcomplete transform matrix A is estimated using the proposed approximate closed from solution. In the second step

(Sparse coding update), given A, the remaining variables yl and xl are updated similarly as proposed in [120]. Commonly, a sparsity level sl for the sparse code yl is chosen such that 2 the denosing error term ql xl 2 is bounded from above by a constant. The usual bound is 2 2 ∥ − ∥ 2 † ql xl CNσ , where C is a constant, σ is the noise variance, xl = e1√τql + e2A yl ∥ − ∥2 ≤ and yl is estimated as a solution to (P2). Here, instead, we upper bound just the inner T 2 product of the estimate xl, i.e., xl xl C0CNσ , where C0 is an additional constant which ≤ 1 we empirically observed that it should be set to 2 . The new estimates for the sparsity levels

2 Note that the coefficients [e1,e2] have to be computed only once, stored and then reused in the later computations. 58 Learning NT for Image Denoising

4 4000 ×10 Transfrom Error -3 -3.5 3000 -4 T 2000 -4.5 -Tr{AXY } -Tr{AG} -5 1000 -5.5

0 -6 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Iteration Number Iteration Number a) b) 20 0.02 Conditioning Number Expected Coherence

15 0.015

10 0.01

5 0.005

0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Iteration Number Iteration Number c) d) 2 T Fig. 4.1 The evolution of a) the transform error AX Y F , b) the Tr AXY and its lower bound approximation Tr AG c) the conditioning∥ − ∥ number− and{ d) the expected} mutual coherence µ(A) while− learning{ } the transform matrix A on overlapping 8 8 noisy image blocks (equivalently N = 64) from the Cameraman image, where M was set× to 80 and the sparsity level was set to 36.

sl, i, are used in the next Transform estimate update, and the procedure is iterated between ∀ Transform estimate and Sparse coding updates until the predefined number of iterations is 2 2 reached. In the final iteration, only these xl that satisfy ql xl CNσ are considered as ∥ − ∥2 ≤ the actual denoised image patches. Given the final estimates the denoised image x is obtained in the same fashion as in [2], [119] and [120].

4.4 Numerical Evaluation

This section validates the proposed approach by numerical experiments and demonstrates its advantages.

Data and Algorithm Initialization To evaluate the potential of the proposed approach we used the Peppers, Cameramen, Barbara, Lena and Man images at an image resolution of 256 256, 256 256, 512 512, 512 512 and 512 512, respectively. The following × × × × × 4.4 Numerical Evaluation 59

1.2 amount of data 1% 0.8 sparsity lelvel 4 amount of data 2% sparsity lelvel 10 amount of data 3% sparsity lelvel 16 amount of data 4% 1 amount of data 5% sparsity lelvel 22 amount of data 6% 0.6 sparsity lelvel 28 amount of data 7% sparsity lelvel 34 0.8 amount of data 8% sparsity lelvel 40 amount of data 9% sparsity lelvel 46 amount of data 10% 0.4 sparsity lelvel 52 0.6 amount of data 11% sparsity lelvel 58 amount of data 12% sparsity lelvel 64 amount of data 13% 0.4 amount of data 14% sparsity lelvel 70 amount of data 15% 0.2 0.2

0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Iteration Number Iteration Number a) b)

2 AX Y F Fig. 4.2 The evolution of the normalized transform error ∥ −L ∥ , where L is the total number of samples xl 1,...,L under a) sparsity levels s 4,10,16,22,28, 24,40,46,52,58,64,70 and b)∈ { amounts} of data expressed in percentage∈ { from the total amount of data while learning} the transform matrix A on overlapping 8 8 noisy image blocks (equivalently N = 64) from the Cameraman image, where M was× set to 80 and the sparsity level was set to 36.

7 algorithm parameters where used N = 64,M = 80,λ1 = λ2 = λ3 = 10 10 , C = 1.08,C0 = × 1/2 and τ = 0.01/σ. The algorithm is initialized with a random matrix having i.i.d. Gaussian (zero mean, variance one) entries.

Denoising Setup The denoising recovery performance is evaluated at noise levels σ = 10 and σ = 20 and the sparsity is set to 25 and 19, respectively. The transform is learned by executing 300 iterations. The results are obtained as an average of 3 runs. We use a not optimized Matlab implementation running on a PC with an Intel® Xeon(R) 3.60GHz CPU and 32G RAM memory. For each of the noisy images a sparsifying transform matrix A is learned using only 1% 15% of the total amount of its noisy patches. The result of εCAT − is compared with the results of the algorithms proposed in [119] (TL-S), [120] (TL-O), [2] (K-SVD), [152] (OCTOBOS), [6] (FRIST) and [28] (BM3D).

Results The results are shown in Figures 4.1 and 4.2, and Tables 4.2, 4.1 and 4.3. Our empirical validation suggested that in our algorithm the solution for A expressed by (4.12) with (4.13) when using the bounds (4.20), is equivalent to the solution (4.12) with (4.13), but without the constraint in (4.13). We noticed that the resulting σA(n) are higher then 1 σC(n), n N , which implies that the constraint is implicitly satisfied and the proposed βλmin ∀ ∈ solution without the same constraint in (4.13) preserves the gradient. Therefore, we present results using the solution for A that is without the explicit inequality constraint in (4.13). The evolution of the transform error, the term Tr AXYT , its lower bound Tr AG , the { } { } conditioning number and the expected mutual coherence µ(A) while learning the transform 60 Learning NT for Image Denoising

σ TL-S TL-O K-SVD εCAT FRIST OCTOBOS BM3D 10 34.45 34.49 34.2 34.44 34.68 34.57 34.66 Peppers 20 29.98 30.60 30.82 30.63 31.02 30.97 31.29 10 33.93 33.83 33.72 33.93 34.16 34.15 34.18 Cameramen 20 29.93 29.95 29.82 30.12 30.33 30.24 30.48 10 34.45 34.55 34.42 34.60 34.57 34.64 34.98 Barbara 20 30.53 30.90 30.82 30.91 / 31.05 31.78 10 34.48 34.68 35.28 34.96 35.67 35.64 35.93 Lena 20 32.12 32.00 32.43 32.51 / 32.59 33.05 10 32.96 32.16 32.71 32.75 33.06 32.98 33.98 Man 20 29.57 28.63 29.40 29.41 29.76 29.74 30.03

Table 4.1 Denosing performance in PSNR, where σ is the noise standard deviation. matrix A on Cameraman image are shown in Figure 4.1. The transform error rapidly decreases, the lower bound Tr AG approximation is well below Tr AXYT while the { } { } conditioning number and the expected mutual coherence µ(A) are decreased from initial values and remain low. This suggests that the proposed solution efficiently reduces the transform error while satisfying the regularization constraints on A. In Figure 4.2 a) we present the evolution of the transform error across a varying number of sparsity while using an 80 64 transform dimension and using 25% of the total amount × of data. The transform error is decreasing for all sparsity levels and for higher sparsity levels the rate of decrease on the transform error is faster. In the same figure under b) we show the evolution of the transform error across varying amount of data while the sparsity level is fixed to 25, the transform dimension is set to 80 64, and the parameters λ2,λ3,λ4 are × { } set to a value of 10 107. We see that the actual error is decreasing and the rate of decrease × is increasing as we increase the amount of data and it saturates around 14% and 15%. This confirms that the proposed algorithm with the introduced update for A can attain satisfactory solution with low transform error while using a small amount of data. Considering the results that are shown in Tables 4.3 and 4.1 only 15% of the total amount of available patches were used for learning the transform matrix of the εCAT algorithm, whereas the rest of the algorithms use 25% 100%. This reflects the resulting execution − time that (as shown on Table 4.2) is around 4 , 2 , 9 , 3 and 3 faster then TL-S, × × × × × TL-O, K-SVD, FRIST and OCTOBOS, respectively.

TL-S TL-O K-SVD εCAT FRIST OCTOBOS BM3D ldata[%] 25 100 25 100 100 3 15 100 100 100 − − − te[min] 4.6 2.9 9.8 1.24 3.1 3.3 /

Table 4.2 The execution time in minutes and the percentage of the used image data. 4.5 Summary 61

ldata[%] 1 2 3 5 7 10 t [min] 0.17 0.22 0.25 0.37 0.45 0.70 Peppers e PSNR 31.5 33.8 34.1 34.3 34.4 34.4 t [min] 0.17 0.22 0.26 0.5 0.8 1.01 Cameramen e PSNR 31.5 33.3 33.4 33.7 33.8 33.9 t [min] 0.22 0.34 0.49 0.66 0.83 1.13 Barbara e PSNR 31.5 33.3 33.4 33.7 34.1 34.3 t [min] 0.25 0.40 0.61 0.70 0.86 1.14 Lena e PSNR 31.5 33.3 33.4 34.4 34.4 34.7 t [min] 0.27 0.41 0.61 0.73 0.87 1.17 Man e PSNR 30.0 31.8 31.9 32.5 32.6 32.7

Table 4.3 The PSNR for the εCAT algorithm learned on percentage of the available noisy image data with noise level σ = 10.

Table 4.1 shows the evaluation across different images and comparisons with several algorithms. The proposed algorithm has slightly better overall denoising results for the used noise levels σ 10,20 compared to the TL-S, TL-O and K-SVD algorithms. On the other ∈ { } hand w.r.t. the rest of the algorithms the results are competitive, but have slightly lower PSNR. We explain this by the fact that a flipping and rotation invariance, grouping and block similarity priors were used in the FRIST, OCTOBOS and BM3D algorithms, respectively. In the current version of our algorithm these priors were not considered. Nonetheless, some of the benefits when using εCAT are notable in Table 4.3. Even when using only 3% 7% of − the noisy image patches during learning there is no big degradation in the final results and they remain competitive compared to the TL-S, TL-O and K-SVD algorithm.

4.5 Summary

In summary, the results confirm that the main advantage of the current version of the proposed algorithm when using the bounds (4.20) for updating A by (4.12) with (4.13), but without the constraint in (4.13) is the implicit preservation of the gradient and the ability to rapidly decrease the transform error and thereby the objective per iteration. This results in fast convergence and a small amount of data required to learn the model parameters. In this chapter, we considered the transform model, where we presented an iterative, alternating algorithm that has two steps: (i) transform update and (ii) sparse coding. In the transform update step, we focused on novel problem formulation that was based on a lower bound of the objective and addressed a trade-off between (a) how much the gradients of the approximative objective and the original objective are aligned, and (b) how much the lower 62 Learning NT for Image Denoising bound is close to the original objective. This led us to three advantages. First, an approximate closed form solution. Second, a possibility to find an update that can lead to accelerated local convergence. Third, an update estimate that can lead to a satisfactory solution under a small amount of training noisy image patches. Since in the transform update, the approximate closed form solution preserved the gradient and in the sparse coding step, we used exact closed form solution, for the resulting algorithm we proved a guaranteed convergence, which was also validated by the numerical experiments. Chapter 5

Learning Robust and Discriminative NT Representation for Image Recognition

This chapter consists of three major sections. In the first section, we estimate sub-optimal, robust and local NT representations for the image face recognition problem. In the second one, we introduce a generalized nonlinear transform for the estimation of a generic sparse and discriminative image representation useful for an image recognition task. In the third section, we extend our generalized nonlinear transform by considering a discrimination specific structuring together with a self-collaboration component for an image recognition task.

5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition

In recent years, many lines of research related to sparse data representation (considering both estimation and reconstruction) were treated [142], [44], [14], [38], [17], [143], [103], [104], [37], [59], [58], [3], [111], [30], [133], [9], [165],[39], [23], [64], [24] and [55]. Among the many applications, the sparse data representations were extensively used for the face image recognition problem. Most commonly, the sparse representation (SR) based face recognition algorithms used the convexly relaxed approaches [142], [44], [14], [38] and [17]. Assuming that a probe x ℜN 1 and a codebook D ℜN KM are given, where K is the number of subjects and M ∈ × ∈ × number of samples per subject, then the sparse representation is a solution to the following problem: 1 yˆ = argmin x ϕ (D,y) 2 + λm(y), (5.1) y 2∥ − ∥2 64 Learning Robust and Discriminative NT Representation for Image Recognition where ϕ(.) is the "mapping" function and m(.) is the prior that enforces the sparseness properties on the representation coefficients y ℜKM. ∈ Obtaining an approximate sparse signal representation as in [156], [88], [99], [167], [42], [62], [66], [33], [75], [150], [68], [81] and [155] by solving a convexly relaxed, inverse problem (5.1) has high computational complexity when the data sample x has high dimensionality or when the codebook D has the large number of codewords. Moreover, depending on the imposed sparsity prior the exact solution might even not be feasible. On the other hand, even if an approximate, feasible solution to (5.1) is achievable the robustness of the estimate under noise remains a sensitive issue. At the same time, the images taken by the devices in an unconstrained environment are usually have various distortions and degradations. Different human facial expressions, poses, and illumination conditions affect the quality of face images, causing occlusion, translation and scale errors. In this line, there are several open issues related to: (i) The observation conditions caused by light variability that usually lead to different appearances (ii) Unknown model of distortions and suitable similarity metric that captures these type of deviations (iii) Not precise global measure that models the geometrical occlusions, like facial ex- pression and emotions that lead to local mismatches between two images even under prefect alignment.

Furthermore, handling these problems in high dimensional feature space over large image databases or over an under sampled data set, rises the memory usage issues and makes the task of face recognition even more challenging. While the recent advert of deep learning raised a lot of attention to the recognition problem, where many researchers have reported impressive results, the optimal structure of the recognition system as well as the nature of the learned features remains an open issue due to the difficulty in the interpretations and in the understanding of the learning dynamics. At the same time, deep neural networks were shown to be vulnerable to adversarial attacks [48], raising open questions concerning the privacy and security aspects of a real world system. Additionally, several researchers address interpretation of a deep CNN in the scope of a synthesis model [90], where the role of local nonlinearities such as ReLu are considered in connection to the explicit sparsity constraint that is similar to the one given in (5.1). 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 65

5.1.1 Contributions

As we have shown in Chapter4, one of the main advantage of the sparsifying transform model w.r.t. the synthesis model is the closed form solution in the sparse representation estimation, which can also be seen as an analogy to the feed-forward mapping at a single layer in a feed-forward deep network. Therefore, to address the aforementioned issues, we propose a method that is based on a single level decomposition, which is rooted in a multilevel vector quantization (MVQ) approach [7]. We adopt a simple image patch based description. The decomposition of images on local blocks is explained by the necessity to cope with the non stationary nature of distortions which can be approximated by block-wise stationary ones using local block decompositions. The idea is not to use the actual local blocks, but rather a robust NT representation w.r.t. the learned centroids over the corresponding available local blocks in order to add a more reliable contribution in the aggregation over local decisions that use the local NT representations. Along this line, the local NT representation is central in our approach. Given a codebook DT and a data sample x, our robust encoding of x w.r.t. the codebook DT represents a constrained sparse likelihood approximation:

1 yˆ = argmin ϕ DT ,x y 2 + λm(y), (5.2) y 2∥ − ∥2  where ϕ(.) is non-linear function and represents the likelihood of x with respect to the code- book DT , and m(y) constrains y ℜKM to be sparse and robust. Note that (5.2) represents ∈ a direct problem i.e., a constrained projection problem while its solution gives the sparse approximation of ϕ DT ,x . In this section, we provide the following main contributions: (i) We propose an aggregation method over local bag-of-words decisions from locally estimated robust NT representations w.r.t. a learned centroids over local blocks (ii) We give generalization of the sparse likelihood approximation that provides connec- tions to Maximum a Posteriori (MAP) estimate (iii) We present connection to the common nonlinear thresholding techniques and propose estimation of the local and robust NT representations (iv) We addresses the solution to a low complexity constrained projection problem with norm ratio constraints and provide its convergence guarantee (v) We give extensive numerical validation on face image databases. 66 Learning Robust and Discriminative NT Representation for Image Recognition

5.1.2 Related Work

In general, the face recognition algorithms usually use face image descriptions that are grouped in two main types: Holistic and Component based ones. The first type adopts a holistic model that identifies the label of a face image using a global representation. The second type utilizes local image representation, that first divides the face image into patches and then extracts features from each patch.

Rigid, holistic sparse representation based face recognition In the past, the Nearest Neigh- bour (NN) [25] and Nearest Feature Subspace (NFS) [137] was used for image face recogni- tion where very lose form of sparsity was exploited. NN classifies the query image by only using its Nearest Neighbor. It utilizes the local structure of the training data and is therefore easily affected by noise. Exploiting the subspace structure of the data, NFS approximates the query image by using all the images belonging to an identical class. Class prediction is achieved by selecting the class of images that minimizes the reconstruction error. NFS might fail in the case that classes are highly correlated to each other. Several crucial issues, such as robustness to noise and image variability were addressed by one generalization of the NN and NFS called Sparse Representation based Classification (SRC) [156]. The underling idea is to represent query sample x as a linear combination over the codes in codebook D. The image is considered to be represented by a sparse representation a that is composed of coefficients that linearly reconstruct the image via the codebook. Let ϕ (D,y) = Dy, m(y) = y 1, then (5.1) is equivalent to the SRC formulation ∥ ∥ for SR:

1 2 yˆ =argmin x Dy + λ y 1. (5.3) y 2∥ − ∥2 ∥ ∥

It is expected that only those codes in the codebook D that truly match the class query sample contribute to the sparse code y. The authors in [156] exploited this by computing a class-specific similarity measure. More specifically, they computed the reconstruction error of a query image to class k by considering only the sparse codes associated with the codebook codes of the kth class. The class that results in the minimum reconstruction error specifies the label of the query. SRC [156] has been an inspiration to many others extensions and improvements [88], [99], [167], [42], [62]. SRC assumes that the face image has a sparse representation and exploits the discriminative nature to preform classification. Qinfeng at al. [130] argue that the lack of sparsity in the data means that the compressive sensing approach cannot be guaranteed to recover the exact signal and therefore that sparse approximations may not deliver the desired robustness and performance. It has also been shown [22] that in some cases, the 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 67 locality of the dictionary codewords is more essential than the sparsity. An extension of SRC, denoted as Weighted Sparse Representation based Classification (WSRC)88 [ ] integrates the local structure of the data into a sparse representation in a unified formulation. While the previous methods can only promote independent sparsity [141], one can parti- tion variables into disjoint groups and promote group sparsity using the so called group Lasso regularization [99]. To induce more sophisticated structured sparsity patterns, it becomes essential to use structured sparsity-inducing norms built on overlapping groups of variables [167], [66]. In a direction related to group sparsity, [42] proposed a more robust classification method using a structured sparse representation, while [33] introduced a kernelized version of SRC. The authors in [75] improve SRC by constructing a group structured dictionary by concentrating sub-dictionaries of all classes. The authors in [62] introduced a class of structured sparsity inducing norms into the SRC framework to model various corruptions in face images caused by misalignment, shadow (due to illumination change), and occlusion, and developed an automatic face alignment method based on minimizing the structured sparsity norm. Different from these techniques, the authors in [150] proposed a framework that was called adaptive sparse representation-based classification (ASRC), in which sparsity and correlation were jointly considered. Specifically, when the samples are of low correlation, ASRC selects the most discriminative samples for representation, like SRC. When the training samples are highly correlated, ASRC selects most of the correlated and discriminative samples for representation, rather than choosing some related samples randomly. Two critical issues were addressed in [68]. The first, when there is a large, but sufficiently number of representative identities and the second, when uncorrupted training images cannot be guaranteed for every identity. A sparse-and dense-hybrid representation (SDR) framework was proposed to alleviate the problems of SRC. The authors in [68] further propose a procedure of supervised low-rank (SLR) dictionary decomposition to facilitate the proposed SDR framework. In addition, they addressed the problem of corrupted training data with their proposed SLR dictionary decomposition.

Component sparse representation based face recognition. In [81] the image is first di- vided into modules and each module is processed separately to determine its reliability. A reconstructed image from the modules weighted by their reliability is formed for robust recognition. The modular sparsity and the residual were used jointly to determine the mod- ular reliability. In comparison to related state-of-the-art methods, experimental results on benchmark face databases confirmed an improvement. The authors in [155] proposed the SR encoding to be performed on local image patches rather than the entire face. The obtained sparse signals were pooled via averaging to form 68 Learning Robust and Discriminative NT Representation for Image Recognition multiple region descriptors, which then form an overall face descriptor. Owing to the deliber- ate loss of spatial relations within each region (caused by averaging), the authors proposed a resulting descriptor that was robust to misalignment and various image deformations up to a certain degree. Within the proposed framework, the same authors evaluated several SR encoding techniques: ℓ1 minimization, sparse autoencoder neural networks (SANN) and an implicit probabilistic technique based on Gaussian mixture models. Their experiments demonstrated that ℓ1 minimization based encoding has a considerably higher computational cost when compared with SANN-based and probabilistic encoding, but provided higher recognition rates.

5.1.3 Decisions Aggregation Over Local NT Representation

Our aggregation concept relies on the recognition decisions, which use the local NT repre- sentation and consists of the two usual phases: training and testing. In the following, we describe the respective phases.

Training Phase Overall, our training phase consists of three components: 21 Multilevel vector quantization based recognition 1/3 (a) Codebook construction using local image blocks (b) Block encoding, which represents estimation of the local NT representation w.r.t. the . Codebook generation local codebooks (c) Construction of a local bag-of-words codes using the training local NT representation.

Clustering Clustering … …

Fig. 5.1 The illustration of the clustering over local blocks. N CK Codebook Construction Given a training set of images X = [x1,...,xCK] ℜ , we − ∈ × denote the local training set consisting of all the images patches at location j B = 1,...,S ∈ { } as:

L CK B j = [b j,1,...,b j,CK] ℜ × , j B, ∈ ∈ (5.4) L b j,i = E jxi ℜ ,i 1,...,CK , ∈ ∈ { } 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 69

L N where the matrix E j 0,1 is used to extract the image patch b j,i at location j. The ∈ { } × local codebook learning aims at finding a set of codes:

L S D j = d j,1,...,d j,S ℜ × , (5.5) ∈   that cover the space of the available local data patches B j and depending on S << KC, coarsely describes the space of the available local data variability. The codebook is a solution to the following problem:

CK ˆ D j, wˆ j,1,...,wˆ j,CK =arg min ∑ b j,i D jw j,i 1 { { }} D j, w j,1,..,w j,CK ∥ − ∥ { } i=1 (5.6) S subject to w j,i 0 = 1,w j,i 0,1 , i 1,...,CK , ∥ ∥ ∈ { } ∀ ∈ { } where w j,i is the synthesis sparse representation. Note that under the constraints w j,i 0 = 1 S ∥ ∥ and w j,i 0,1 , (5.6) becomes equivalent to the k-means algorithm [26].Nonetheless, the ∈ { } solution results in a set of S basis vectors D j = [d j,1,..,d j,S]. We illustrate the codebook construction process in Figure 5.1. A Sparse Likelihood Approximation Expressed by a NT Representation The core of − the proposed approach is to use a low complexity encoding based on a likelihood function while aiming at robust estimate. Therefore, in the following subsection, we first, define a likelihood vector w.r.t. a codebook. Then, the generalized view of sparse likelihood approximation is given and the correspondence to maximum a posterior estimation is shown. Afterwards, we present the probabilistic equivalent to the Hard and Soft thresholding. We propose and describe the ℓ1 -norm ratio constrained likelihood approximation, next to the ℓ2 main result about its convergence. At the end of the subsection, we explain the local subject code estimation. L S We assume that a codebook D j ℜ , consisting of S codewords [d j,1,...,d j,S] and an ∈ × image patch b j,i are given. We denote the probability (likelihood) of a single codeword d j,s given b j,i as:

d(d ,b ) 1 − j,s j,i p(d j,s b j,i) e σ ,s 1,2,...,S , (5.7) | ∼ η ∈ { } and define the likelihood vector as:

T l j,i =ϕ D j ,b j,i = d(d ,b ) d(d ,b ) 1 − j,1 j,i − j,S j,i (5.8) p(d j,s b j,i),..., p(d j,S b j,i) e σ ,...,e σ Θ, | | ≃ η ∈     70 Learning Robust and Discriminative NT Representation for Image Recognition

d(d ,b )) N N + S j,s j,i where d(.,.) : ℜ ℜ ℜ is a similarity measure, η = ∑ e σ , σ is a scaling × → i=1 − parameter. The set of probability vectors Θ (equivalently the probability simplex) is denoted as:

S Θ l j,i : ∑ l j,i(s) = 1, i,l j,i(s) 0,S < ∞ , (5.9) ∈ { s=1 ∀ ≥ }

Theorem 3 Given the likelihood vector l j,i = p(d j,1 b j,i),..., p(d j,S b j,i) Θ, the max- | | ∈ imum a posterior probability vector y j,i = p(b j,i d1,S),..., p(b j,i d j,S) Θ is equal to a  | | ∈ solution of a constrained projection problem : 

Πm : yˆ j,i =arg min ρ y j,i,l j,i , subject to m y j,i = 0. (5.10) y j,i Θ ∈   where m y j,i is a constraint on the properties of y j,i. (.) For the proof please see Appendix C.1. In (5.10), the divergence measure ρ is interpreted as a cost function. In the case of Kullback–Leibler divergence (5.10) is a constrained information projection [115]. In the case of squared divergence ( . 2 norm), ∥ ∥ (5.10) is constrained Euclidean projection. The latter is related to the Kullback–Leibler divergence through the direct [129] and reverse [134] Pinsker inequality. More importantly it has low computational complexity. 1 2 Let ρ (.) = . and m(.) = . 0 k, then the hard thresholding formulated as a con- 2 ∥ ∥2 ∥ ∥ − strained projection is:

S 1 2 1 ΠH :yˆ j,i = arg min l j,i y j,i 2, subject to y j,i 0 k = 0, y j,i 0, . (5.11) y j,i Θ 2∥ − ∥ ∥ ∥ − ∈ k ∈   The solution is: 0, if s S Sk , yˆj,i(s) = ∈ { \ } (5.12)  1 , if s Sk,  k ∈ where S = 1,...,S is the set of all the indexes, Sk S is the set of the indexes for the k { }  ⊂ largest values of l j,i and S Sk is the compliment of the set Sk. 1 2 { \ } Let ρ (.) = . and m(.) = . 1 k, then the ℓ1 soft thresholding has the following 2 ∥ ∥2 ∥ ∥ − constrained projection formulation:

1 2 Πℓ1 : yˆ j,i =arg min l j,i y j,i 2, subject to y j,i 1 k = 0. (5.13) y j,i Θ 2∥ − ∥ ∥ ∥ − ∈ 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 71

The solution is:

0, if l j,i(s) k 0, y (s) = − ≤ ˆj,i l j,i(s) k (5.14)  − , otherwise . ∑s s :l k 0 l j,i(s1) k  1∈{ 2 j,i− ≥ } −

 l √ ∥ j,i∥1 S l ℓ1 − j,i 2 1 2 Denote a -norm sparsity level κD to be defined as ∥ ∥ κD 1 [60]. Let ρ (.) = . ℓ2 √S 1 ≤ ≤ 2 ∥ ∥2 ℓ1 − and m(.) = . 2 k2 the -norm ratio constrained projection problem has the following ∥ ∥ − ℓ2 form:

1 2 Π ℓ1 : yˆ j,i =arg min l j,i y j,i 2, subject to y j,i 2 = k2. (5.15) ℓ y j,i Θ 2∥ − ∥ ∥ ∥ 2 ∈

The constraint y j,i Θ implies that the ℓ1-norm of y j,i, i.e., y j,i 1 has to be equal to ∈ ∥ ∥ 1. Furthermore, using the definition for the sparseness κD the value k2 is estimated as 1 k2 = . √S κD(√S 1) The− main result− about the general case [60] that also holds for (5.15) is given in the following theorem.

ℓ1 Theorem 4 For any vector l j,i ϒ where ϒ is the space of sparsity reducible vectors ∈ ℓ2 defined as ϒ l j,i : l j,i Θ, l j,i(r1) = l j,i(r2) ,r1 = r2, r1,r2 1,...,S ,S < ∞ and a ∈ { ∈ | | ̸ | | ̸ { } ⊂ { } } sparsity level κ smaller than the desired sparsity level κD, the solution of (5.15) converges to the optimal point in a finite number of iterations. The proof is given in Appendix C.2. Problem (5.15) is solved iteratively. Considering a probability vector (e.g., a likelihood vector l j,i) the previous theorem holds and (5.15) has a unique solution for all the probability vectors that lie in the intersection of the probability simplex and the space of ℓ1 sparsity ℓ2 reducible vectors. On the other hand any vector vector that lies outside this vector space has several components with identical absolute values. If these components are also the highest values, and if the ℓ1 sparsity level is such that the solution of (5.15) is in the subset of these ℓ2 components then (5.15) will not have a unique solution.

Supervised Local Subject Code Construction Given the local training set B j that was − used for local codebook learning, the local subject codes are constructed in two steps. In the first step, we perform an encoding w.r.t the learned local codebook D j by sparse likelihood approximation. This encoding results in a number of sets K, which equals to the number of classes K:

Y j,1,...,Y j,K , (5.16) { } 72 Learning Robust and Discriminative NT Representation for Image Recognition

Fig. 5.2 An illustration of the code s j,k construction for subject k at image patch location j.

of M sparse likelihood approximations:

Y j,k = [y j,(k 1)M+1,...,y j,kM]. (5.17) −

In the second step, the local subject code s j,k for subject k K = 1,...,K using the set ∈ { } Y j,k is estimated as:

s j,k = E f (y) Y j,k , (5.18) { }

where E f (y) . represents the expectation operator with respect to function f (y). However, { } since the type of the distortion in training data is unknown in general, it is often assumed to be a uniform probability distribution that leads to the so called average-pooling:

1 kM s = y . (5.19) j,k M ∑ j,z z=(k 1)M+1 −

The set of local subject codes at position j is represented by a matrix S j = s j,1,...,s j,K . We illustrate the scheme in Figure 5.2.  

Test Phase The test phase consists of two parts: (a) block encoding which represents the estimation of the local NT representation w.r.t. the local codebooks and (b) aggregation of the local bag-of-words recognition decisions using the NT representations. The goal here is to find and fuse a reliable set of similar local blocks. Therefore, atthis stage local scores are estimated and the recognition is performed based on the local score fusion. An illustration of the local score fusion is given in Figure 5.3.

Given a query image xq, a local image block is extracted as:

p j = Elxq, j B. (5.20) ∈ 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 73

Given p j and using the local codebook D j for the corresponding image patch location j, the respective likelihood vector l j,q is constructed using (5.8). Then the robust NT representation y j,q is estimated using (5.15). Local Score Having y j,q and the local subject codes s j,k,k K , the local score is − ∈ computed as:

T 1 t j = S y j,q ΣS1, (5.21) j − 2 where ΣS is a diagonal matrix, with diagonal elements equal to the diagonal elements of T K 1 matrix S j S j and 1 ℜ is an unit vector. ∈ × Basic Fusion The global fusion score used here is based on the maximum likelihood − principle that under the Gaussian assumption represents maximum cross-correlation. More- over, the final score by basic fusion corresponds to the most likely subject k K that obtains ∈ the majority of local decoding scores:

B kˆ = arg max t j(k). (5.22) 1 k K ∑ ≤ ≤ j=1

t √ ∥ j∥1 K t Weighted Fusion Given t , j , denote its sparsity level as h( j) = − ∥ j∥2 . Define j B √K 1 − ∈ − the vector of weight coefficients as h = [h(1),...,h(B)] and its NT representation h ℓ1 that is ℓ2 estimated using (5.15) as h ℓ1 = Π ℓ1 (h). Then, the final score by weighted fusion corresponds ℓ2 ℓ2 to the most likely subject k that obtains the h ℓ1 weighted majority of the votes: ℓ2

B ˆ k = arg max ∑ h ℓ1 ( j)t j(k). (5.23) 1 k K ℓ ≤ ≤ j=1 2

5.1.4 Asymptotic Analysis of Computational Cost and Memory Usage

This section gives an analysis of the computational cost and the memory usage for the two operation modes: training and recognition of the proposed face recognition architecture. The training consists of local codebook learning, encoding with respect to the local codebook and subject codes estimation. Since we use the k-means [26] algorithm, the computational cost of running k-means once for i iterations is: O(iSKML), where S is the number of clusters, KM is the number of data samples and L is the local block length. The encoding cost is BSL addition operations and BS multiplication operations for computing the distance measure. The sparse likelihood 74 Learning Robust and Discriminative NT Representation for Image Recognition

Fig. 5.3 An illustration of the recognition based on aggregation over local bag-of-word decisions, that use the local NT representations. approximation per one sample has a cost of O(logS). The cost of estimating one subject code is SM addition operations and for K subjects the total cost is KSM addition operations. The recognition consists of encoding and subject code matching. The encoding cost is BSL addition operations and BS multiplication operations for the distance computation. The sparse likelihood approximation has a cost of O(logS). The matching has cost of KS multiplications. The comparison of the computation cost using approximate sparse representation and sparse code likelihood approximation is shown in Table 5.1. In Table 5.1, L2 stands for a least square solution, LLC stands for Locality constrained Linear Coding [151], SRC for Sparse Representation base face Classification [156], WSRC Weighted Sparse Representation base face Classification [88], L1 stands for a least absolute deviation solution, 2L1 stands for a least absolute deviation with L1 norm regularizer, BP for Basis Pursuit [156] and SLAR is the proposed method. The variable p denotes the number of non-zero values of the vector a and d is the number of simplex vertexes in the solutions space of the sparsity constrained inverse problem. The memory usage depends on the size of the codebook, the length of the codebook codes and the number of subject codes. The proposed algorithm typically uses a smaller number of 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 75

Computational cost for the baseline, the approximate SR and the SLAR method

2 2 L yˆ = argmin x Wy 2 O(N KM) 2 y ∥ − ∥ 2 2 LLC yˆ =argmin y 2 subject to x Wy 2 Nε O(N KM) y ∥ ∥ ∥ − ∥ ≤ 2 2 SRC yˆ =argmin y 1 subject to x Wy 2 Nε O(NKMp ) y ∥ ∥ ∥ − ∥ ≤ 2 2 WSRC yˆ =argmin Zy 1 subject to x Wy 2 Nε O(NKMp ) y ∥ ∥ ∥ − ∥ ≤ 2 √d logd L1 yˆ = argminy x Wy 1 O((KM) N) + e ∥ − ∥ 2L yˆ =argmin y 1 subject to x Wy 1 Nε O((KM)2N) + e√d logd 1 y ∥ ∥ ∥ − ∥ ≤ BP yˆ =argmin y 1 subject to x = Wy O((KM)2N) + e√d logd y ∥ ∥ yˆ =argminρ (ℓx,y) subject to Ω(y) = 0 SLAR y Θ O(BSL) ∈

Table 5.1 Computational complexity in big O(.) notation

Memory usage for the baseline, the approximate SR and the SLAR method

Algorithm L2, LLC, SRC, WSRC,L1,2L1,BP SLAR Codebook size KMS < KM

Table 5.2 Memory usage codebook codes than the number of training data samples. The memory requirement for the subject codes is low since they are typically sparse. The comparison of memory usage using the approximate sparse representation and sparse code likelihood approximation is shown in Table 5.2.

5.1.5 Numerical Evaluation

A computer simulation is performed to compare the recognition rate of the proposed method with several state-of-the-art methods that are based on a solution of the inverse problem (5.1).

Face Image Databases Four publicly available face datasets were used: 76 Learning Robust and Discriminative NT Representation for Image Recognition

Extended Yale B [46] The Extended Yale B consists of 2414 frontal face images of − 38 subjects acquired under various laboratory-controlled, but extreme lighting variability. All the images from this database were cropped and normalized to 192x168 pixels. AR [97] The AR database consists of over 4,000 frontal images for 126 individuals. − For each individual, 26 pictures were taken in two separate sessions. These images include a variety of facial variations, including illumination change, expressions variability, and facial disguises. In the experiment here, we choose a subset of the data set consisting of 100 subjects. For each subject, 14 images with only illumination change and expressions were selected. The images were cropped and normalized to 165x120 pixels. PUT [1] The PUT database consists of hi-resolution images of a 100 people. Images − were taken in controlled conditions under various pose variations. In our setup, we use a total of 2200 cropped images, normalized to 178x178 pixels. FERET [113] The FERET database consists of 13,539 facial images corresponding to − 1,565 subjects, who are diverse across ethnicity, gender, and age. In our experiments, we used two subsets FERET-1 and FERET-2 from the FERET database. More precisely we used 546 frontal face images from these sets, which we downscaled to a resolution of 128x128. The facial part of each image in the above datasets was manually cropped and aligned according to the eye position. In all of the computer simulations the face images were converted to gray scale. Raw image pixel values were used as basic elementary features.

Evaluation Setups Six different computer simulations were performed to demonstrate the recognition abilities of the proposed method. We present the recognition rate for the following set-ups: 1) Using basic and weighted fusion with ℓ1 constrained projection, soft thresholding and ℓ2 hard thresholding 2) Under a varying number of training samples 3) Under a varying number of codebook codes 4) Under different databases and varying dimensionality 5) Under continuous occlusion 6) Under random pixel corruption.

In all of the experiments we used trained local codebooks D j, j 1,2,3,...,B . The number ∈ { } of local codebooks is equivalent to the number of block locations B. Every local codebook

D j is learned using the k-means algorithm. In the setups: 1,2,3 and 4 we use a block size of L = 9, while in the setups: 5 and 6 we use different block sizes L 9,16,25,225 in ∈ { } different simulations. At training, over all setups: 1,2,4,5 and 6 we set the number of 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 77

Recognition using basic fusion Recognition using weighted fusion 100 100

90 90

80 80

70 70

60 60

Recognition rate (%) SLAR-HARD Recognition rate (%) SLAR-HARD 50 SLAR-SOFT 50 SLAR-SOFT SLAR-L /L 1 2 SLAR-L /L 1 2 40 40 3042 120 504 3042 120 504 Image feature dimention Image feature dimention

Fig. 5.4 Recognition results under basic fusion with ℓ1 constrained projection, soft thresh- ℓ2 olding and hard thresholding and weighted fusion with ℓ1 constrained projection, soft ℓ2 thresholding and hard thresholding.

codebook codes to be equivalent to half of the available training data. Accordingly, in setup 2 the number of codebook codes is varying. To be unbiased in our validation of the results we use an average of 5 runs. Considering the setups: 1,2,4,5 and 6 at a single run, for every subject, at random half of the images are selected for training and the remainder for testing. Considering setup 3 a varying number of training samples is used in different simulations. At training, during the subject

code estimation, we set the sparseness coefficient κD to a high value, i.e., κD = 0.96. At recognition, during the sparse activation estimation, we set the sparseness coefficient to a

lower value of κD = 0.91.

Recognition Results In the following text, we discuss our results. Recognition using basic and weighted fusion with ℓ1 constrained projection, soft and − ℓ2 and hard thresholding The used database for this experiment was Yale B, the down- sampling ratio was set to 1/16. The results are shown in Figures 5.4. The results highlight two important points. Firstly, SLAR- ℓ1 has a consistently better performance compared to SLAR-SOFT and SLAR-HARD. ℓ2 Secondly, the weighted fusion improves the recognition accuracy. Therefore, in all of the later experiments we use the weighted fusion. Finally, the recognition accuracy improves with the increase of the feature dimension. Recognition under a varying number of training samples The used database was Yale − B, the down-sample ratio was set to 1/16. Here at random a subset of the images were 78 Learning Robust and Discriminative NT Representation for Image Recognition

Recognition under varaing number of samples Recognition under varaing number of codebook codes 100 100

90 95

80 90

70 85

60 80 SLAR-HARD SLAR-HARD Recognition rate (%) Recognition rate (%) 50 SLAR-SOFT 75 SLAR-SOFT SLAR-L /L SLAR-L /L 1 2 1 2 40 70 152 266 380 494 608 760 874 988 1102 3264 128 256 512 1216 Number of training data samples Number of codebook codes Fig. 5.5 Recognition results under varying number of training samples and varying number of codebook codes.

Recognition under different image dimensions: YALE B Recognition under different image dimensions: AR 100 100

90 90

80 80 L L 2 70 2 LLC LLC 70 SRC SRC WSRC 60 WSRC L L Recognition rate (%) 60 1 Recognition rate (%) 1 2L 50 2L 1 1 SLAR-L /L SLAR-L /L 1 2 1 2 50 40 3042 120 504 3054 130 540 Image feature dimention Image feature dimention Fig. 5.6 Comparative recognition results using Extended Yale B and AR. 5.1 Decision Aggregation Over Local NT Representations for Robust Face Recognition 79

Recognition under different image dimensions: PUT Recognition under different image dimensions: FERET 100 100

90 80

80 60 L L 2 2 LLC LLC 70 SRC 40 SRC WSRC WSRC L L Recognition rate (%) Recognition rate (%) 60 1 20 1 2L 2L 1 1 SLAR-L /L SLAR-L /L 1 2 1 2 50 0 36 64 121 484 25 49 100 441 Image feature dimention Image feature dimention Fig. 5.7 Comparative recognition results using PUT and FARET data sets.

Recognition under random corruptions: YALE B Recognition under continius occlusion: YALE B 100 100

90 80 L 2 L 2 LLC 80 60 LLC SRC SRC WSRC 70 WSRC L 1 L 40 1 2L 1 2L 60 1 BP BP Recognition rate (%) Recognition rate (%) 20 SLAR-[3X3] SLAR-[3X3] 50 SLAR-[4X4] SLAR-[4X4] SLAR-[5X5] SLAR-[5X5] SLAR-[15X15] 0 40 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 Percentage of random corruption (%) Percentage of occlusion (%) Fig. 5.8 Comparative recognition results under random corruption and continuous occlusion. 80 Learning Robust and Discriminative NT Representation for Image Recognition selected for training and the remainder was used for testing. The following sizes of training subsets were used:

152 266 380 494 608 760 874 988 1102,

while correspondingly, the test results were obtained using a subsets, sampled at random with the following sizes: 1064,950,836,722,608,456,342,228,114. The results are shown in Figure 5.5. They show that when the true data distribution and the data distortion are unknown, it is important to use more data samples for higher accuracy. Recognition Under a Varying Number of Codebook Codes The used database was Yeale − B, while the image down-sampled ratio was set to 1/16. A codebook with the following number of codes was used: 32,64,128,256,512 and 1216. The results are shown in Figure 5.5. The results demonstrate that we address a trade off between memory usage and recogni- tion accuracy. This allows using a small amount of memory with high recognition accuracy. Since, the reduction of the accuracy is not drastic when the number of codebook codes is relatively small. Comparative Recognition Results Under Varying Dimensionality In this experiment − the used down-sample ratios per database Yale B are 1/32,1/24,1/16 and 1/8. For the databases AR and FERET are 1/24,1/18,1/12 and 1/6 and for database PUT are 1/28,1/21,1/16 and 1/8. The results are shown in Figure 5.6 and 5.7. As we may see from these results the proposed method has consistently higher recognition rates on all databases per all of the chosen dimensions except for the two smallest dimensions with the Yale B database. Comparative Recognition Results Under Random Pixel Corruption Here the setup is − equivalent to the one defined in156 [ ]. The various levels of random noise, from 0 percent to 90 percent, are simulated by corrupting a percentage of randomly chosen pixels from each of the test images, replacing their values with independent and identically distributed samples from a uniform distribution. The results are shown in Figure 5.8. Higher recognition rates under random noise corruption are achievable, but at the cost of using bigger image blocks. This may be explained by the fact that the bigger blocks are less effected by uniform noise. Comparative Recognition Results Under Continuous Occlusion The various levels of − contiguous occlusion, from 0 to 50 percent, are simulated by replacing a randomly located square block of each test image with an unrelated image. The results are shown in Figure 5.8. In the case of block occlusion the blocks that come from the non-occluded regions are crucial for reliable and accurate results. 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 81

5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition

In the past, sparsity and structured sparsity models were used for various practical problems. Application over different areas were considered, including model-based compressive sensing [11], [4], [40], signal processing [8], [5],[101], discriminative dictionary learning for image classification52 [ ], [69], [19], [45], [86], [15], [67] and [149], computer vision [65], [71], [70], [159], bio-informatics [154], [131], [72], and recommendation systems [76], [140], [124], [166]. Commonly, a sparse representation is attained using a synthesis model [2], [117] (that N is regression with regularization). Meaning that a data sample xc,k ℜ is synthesized M ∈ by a sparse linear combination (sparse representation) yc,k ℜ over a dictionary D N M ∈ ∈ ℜ × , where yc,k is estimated by solving an inverse, reconstruction problem, with a sparsity constraint yc,k 1 on yc,k, i.e., ∥ ∥ 2 yˆc,k = argmin xc,k Dyc,k 2 + λ1 yc,k 1. (5.24) yc,k ∥ − ∥ ∥ ∥

With the synthesis model the computational complexity for estimating the sparse representa- tion yc,k can be high if the dictionary D is large or the dimensionality M is high. Beside the synthesis model other two sparse models were proposed, i.e., the analysis model [127] and [56], and the sparsifying transform model [126], [127], [125] and [122].

In the sparsifying transform model, the sparse representation yc,k is a nonlinear transform M N representation that is estimated using a linear mapping Axc,k (with map A ℜ ) which ∈ × then is followed by thresholding (element-wise nonlinearity) and it represents a solution to a direct, constrained projection problem, i.e.,

2 yˆc,k = argmin Axc,k yc,k 2 + λ1 yc,k 1, (5.25) yc,k ∥ − ∥ ∥ ∥ where the computational complexity for estimating the sparse representation is low, but to the best of our knowledge was not considered as basis, nor extended or generalized for learning sparse and discriminative image representations. An additional advantage of (5.25) is that it offers a high degree of freedom in modeling1 and imposing constraints other than the sparsity constraint on the representation yc,k. This is useful since sparsity alone does not guarantee that the resulting representation will be discriminative.

1Many nonlinearities, i.e., ReLu, p-norms, elastic net-like, ℓ1 -norm ratio, binary encoding, ternary encoding, ℓ2 etc., can be modeled as a generalized nonlinear transform representation. 82 Learning Robust and Discriminative NT Representation for Image Recognition

In the past, in order to enforce a discriminative and sparse representation, a key was to use a synthesis model with constraints [163][149], [53] and [86]. With the synthesis model, learning an overcomplete dictionary M >> N with or without constraints is challenging. On the other hand, the constraints on the sparse representations where shown to be effective [149], [53] and [86]. However, whether the used metric is optimal, the interpretation and the probabilistic modeling connections were not studied. In addition, the relations to the empirical risk and the generalization capabilities of the model were not fully explored. Also, most of the proposed sparse and discriminative dictionary learning approaches [52], [69], [19], [45], [86], [15], [67] and [149] a formal notion that measures the discriminative properties is not addressed. Therefore, there are limited means that provide a quantitative evaluation of the quality of the representation, other than the recognition accuracy of a classifier that uses the representations.

5.2.1 Contributions

In this section, we study a nonlinear transform which when applied to a data sample produces a sparse and discriminative representation. In particular, we focus on a nonlinear transform which is expressible by a two step operation consisting of (i) a linear mapping which is followed by a (ii) generalized element-wise nonlinearity. As the linear mapping consists of a number of linear projections, our motivation comes from the fact that we can restrict a unique subset of them to be aligned to the data samples from one class and to be orthogonal to the data samples from the rest of the classes. In this way, coupled with the generalized element-wise nonlinearity, in the resulting NT representations, we can obtain sparse and non zero patterns that are unique per class. An illustration of the main idea is given in Figure 5.9. Nonlinear Model with Priors We introduce a generalized nonlinear transform model, where we use minimum information loss prior to avoid trivially unwanted linear maps and model a novel discriminative prior to discriminate between the NT representations from different classes. A parametric min-max composite measure for dissimilarity, similarity and strength contribution that is defined on the support intersection describes our prior while allowing a discrimination constraint without any assumption about the metric/measure and space/manifold in the transform domain. The key difference of our nonlinear transform model compared with the synthesis model with constraints is that we do not explicitly address the reconstruction of the data by a sparse linear combination. Rather then that, we not only address a low complexity constrained projection problem, where we estimate the sparse and discriminative representation as its solution, but we also offer interpretation, explanation and connections of the discrimination prior measure to the empirical risk of the model. 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 83

Learning Strategy In general, estimating the model parameters under the Maximum a Posterior (MAP) criteria is difficult. Therefore, given observed data, the parameters of the model are learned by minimizing its approximation to the empirical expectation of the negative log likelihood. While the empirical approximation to the expected negative log likelihood of the discriminative prior provides interpretation about the models empirical risk it also unfolds the optimization cost that has to be minimized during learning. We rely on the integrated marginal maximization principle that addresses the discrimination empirical risk and give an efficient solution that is based on an iterative, alternating algorithm with two stages. In the first stage, we propose an exact closed form solution for the estimation of the sparse and discriminative NT representations. In the second stage, we propose an approximate closed form solution for the linear map update. Numerical Evaluation We validate our approach with numerical experiments, where we learn a NT in a supervised and unsupervised setup, and use the corresponding NT representa- tions to preform supervised and unsupervised image recognition. In the unsupervised case, we use a k-NN classifier, while in the supervised case, we train a linear classifier overtheNT representations from our learned NT. Overall, the evaluation of our approach considers face identification, object recognition, digit recognition and fashion clothing recognition tasks.In summary, the results demonstrate improvements and competitive performance in comparison to the state-of-the-art methods of the same category, regarding the learning time, the run time, the discriminative quality and the recognition accuracy.

5.2.2 Related Work

In the following, we consider related work in several research communities.

Matrix Factorization Models Factor analysis [20] and matrix factorization [60] rely on the decomposition of hidden features without or with constraints. When discrimination constraints are present they act as regularizers that were mainly defined using labels.

Discriminative Dictionary Learning (DDL) Methods The DDL methods address three main elements: (i) synthesis data model, (ii) a prior defined on the representation and (iii) prior determining the relations between the representations under the knowledge of class labels. Usually, the discrimination is enforced by replacing the priors explained in (ii) and (iii) with a structural constraint on the dictionary or imposing a discriminative term on the sparse representations. A comprehensive overview of DDL methods is given in [19], [15] and [45]. Concerning the specifics of the discriminative constraints, a synthesis model witha discriminative fidelity term and Fisher discriminant constraint was proposed in[163], where 84 Learning Robust and Discriminative NT Representation for Image Recognition the within-class scatter and the between-class scatter of the representation is minimized and maximized, respectively. The authors in [149] proposed an extension considering a low-rank constraint on the dictionary. The approach in [53] used a synthesis model with a constraint on the pair-wise relation between the sparse representation expressed by ℓ2-distance metric. The previous three methods ([163][149] and [53]) in the definition of the scatter and the pair-wise relations they take into account assumption about the metric . Therefore, they constrain the space of the representation, which essentially is determined by the dictionary. However, these works do not consider whether the used metric is optimal w.r.t. the sparse representation. The method proposed by [86] finds a dictionary under which the representation of a data sample from the same class c have a common sparse structure by minimizing the size of the support M M overlap for the representation from different classes. Assuming yc1,k1 ℜ and yc2,k2 ℜ N ∈ N ∈ are two sparse representations for two data samples xc ,k ℜ and xc ,k ℜ , from two 1 1 ∈ 2 2 ∈ classes c1 and c2, they proposed a similarity measure defined by the empirical expectation on yc ,k yc ,k 0, where represents the Hadamard product. Note that two transform ∥ 1 1 ⊙ 2 2 ∥ ⊙ data samples yc ,k and yc ,k that have small support overlap yc ,k yc ,k 0 = s,s << M, 1 2 2 2 ∥ 1 1 ⊙ 2 2 ∥ might not necessarily be similar or dissimilar, i.e., yc ,k = yc ,k and yc ,k = yc ,k with 1 1 2 2 1 1 − 2 2 yc ,k 0 = yc ,k 0 = s. ∥ 1 1 ∥ ∥ 2 2 ∥ Self-Supervision and Self-Organization In self-supervised learning [35], [112] the input data determines the labels. In self-organization [74], [145] a neighborhood function is used to preserve the topological properties of the input space. Both approaches leverage implicit discrimination using the data.

Discriminative Clustering In [160], clustering with maximum margin constraints was proposed. The authors in [7] proposed linear clustering based on a linear discriminative cost function with convex relaxation. In [80] regularized information maximization was proposed and simultaneous clustering and classifier training was performed. The above methods rely on kernels and have high computational complexity.

Auto-Encoders The single layer auto-encoder [10] and its denoising extension [147] con- sider robustness to noise and reconstruction. While the idea is to encode and decode the data using a reconstruction loss, an explicit constraint that enforces discrimination is not addressed.

5.2.3 Nonlinear Transform Model

Our approach is centered on two components: (i)A nonlinear transform model with minimum information loss; 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 85

TP

NT Transform

Original Domain Transform Domain

Fig. 5.9 An illustration of the idea about our NT transform, where we used different colors to denote the spaces of the data samples from different classes in the original and transform domain. The goal of our NT is to achieve discrimination by taking into account a minimum information loss on the linear map and discrimination prior with a discrimination measure defined on the support intersection for the NT representations.

(ii) Novel discriminative prior that is described using a functional over parametric measure.

The model describes a generalized nonlinear transform representation yc,k that explains the data xc,k, which we expresses as:

p(yc,k xc,k,A) = p(yc,k,θ xc,k,A)dθ ∝ p(xc,k yc,k,θ ,A)p(yc,k,θ A)dθ , (5.26) | θ | θ | | Zθ Zθ where A ℜM N is the linear map of the transform and θ is a parameter that allows the ∈ × notion for discrimination to be formalized. Using the Bayes rule, we neglect the prior p(x A) | and focus on modeling the proportional form, i.e.,

p(yc,k,θ xc,k,A) ∝ p(xc,k θ ,yc,k,A)p(θ ,yc,k A). (5.27) | | |

We further assume that the modeling of p(xc,k yc,k,θ ,A) does not depend on θ , i.e: |

p(xi θ ,yc,k,A) = p(xi yc,k,A), (5.28) | | while in the modeling of p(θ ,yc,k A), in order to simplify, we neglect the dependencies w.r.t. | A, i.e:

p(θ ,yc,k A) = p(θ ,yc,k). (5.29) | By our model (5.27), we assume that the data sample xc,k indexed by k from a group c M N is approximately sparsifiable under a linear transform A ℜ × . The parametric prior ∈ M p(θ ,yc,k) is responsible for discriminating the sparse transform representation yc,k ℜ ∈ that belongs to class c from the rest of the transform representations coming from class d = c. ̸ 86 Learning Robust and Discriminative NT Representation for Image Recognition

Nonlinear Transform Error The term zc,k = Axc,k yc,k is the nonlinear transform error − vector that represents the deviation of Axc,k from the targeted transform representation yc,k in the transform domain. In the simplest form, we model the prior p(xc,k yc,k,A) for 2 | Axc,k yc,k 2 the residual vector zc,k = Axc,k yc,k as p(xc,k yc,k,A) ∝ exp( ∥ − ∥ ), where β0 is a − | − β0 scaling parameter. Nevertheless, additional knowledge about the residual vector zc,k can be used in order to model other measure rather than the ℓ2-norm.

Minimum Information Loss Prior To allow adequate coherence and conditioning on the transform matrix A, a prior p(A) ∝ exp( Ω(A)) is used, that is equivalent to the one defined − in Chapter4. It penalizes the information loss in order to avoid trivially unwanted matrices A, i.e., matrices that have repeated or zero rows. We define the prior measure as follows:

1 2 1 T 2 1 T Ω(A) = A F + AA I F log detA A , (5.30) β3 ∥ ∥ β4 ∥ − ∥ − β5 | |

T 2 where A F helps to regularize the scale ambiguity, the log det (A A) and A together ∥ ∥ | | ∥ ∥F help regularize the conditioning of A, and β3,β4,β5 are scaling parameters. Assuming that 2 { } T the expected coherence µ (A) between the rows am of A (i.e., A = [a1,...,aM]) is defined as 2 2 T 2 µ (A) = am a . M(M 1) ∑ | 1 m2 | (5.31) m1=m2 − ̸ m1,m2 1,...,M ∀ ∈{ } T 2 2 Then AA I measures the expected coherence µ (A) and the ℓ2 norm for the rows of ∥ − ∥F A.

5.2.4 Discriminative Prior

In this Section, we consider the joint probability p(θ ,yc,k) = p(θ yc,k)p(yc,k) and model the | discriminative prior p(θ yc,k) as: | 1 p(θ yc,k) ∝ exp( l(θ ,yc,k)), (5.32) | −β2 where l(yc,k,θ ) is a discriminative, functional measure over the parameter θ = θ 1,θ 2 = M { } τ 1,....,τC , ν 1,....,νC , τ c,ν c ℜ and β2 is a scaling parameter. We use l(yc,k,θ ) {{ } { }} ∈ to measure the similarity contribution of yc,k w.r.t. the parameters θ 1 as well as dissimilarity contributions of yc,k w.r.t. the parameters θ 2. The parameters θ play a key role in describing our formal notion for discrimination. As we will see in the following subsection, a pair of τ c ,ν c from θ appears in the generalized element-wise nonlinear thresholding and { 1 2 } normalization vectors that are applied over the linear transform representation Axc,k in order 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 87

to obtain the corresponding NT representation yc,k. Whereas:

1 p(yc,k) ∝ exp( yc,k 1), (5.33) −β1 ∥ ∥

2 is an improper , sparsity prior on yc,k and β1 is a scaling parameter. Prior Measure We assume that l(yc,k,θ ) is determined by a relation on the support inter- section between vectors yc,k and θ . Basic Vector Support Intersection Measures We introduce two measures ρ and ς that − are defined as on the support intersection of vectors:

+ + ρ(yc,k,τ c ) = y− τ c− 1 + y τ c 1, 1 ∥ c,k ⊙ 1 ∥ ∥ c1,k1 ⊙ 1 ∥ (5.34) 2 ς(yc,k,τ c ) = yc,k τ c , 1 ∥ ⊙ 1 ∥2 + + where yc,k = y y− , y = max(yc,k,0) and y− = max( yc,k,0). Both ρ and ς allow c,k − c,k c,k c,k − more freedom in imposing regularization on the discriminative properties without taking into account any additional assumptions. By using ρ, the focus is on the contributing + + T components for similarity y− τ 1 + y τ 1. Since, when y τ c is considered, ∥ c,k ⊙ c−1 ∥ ∥ c,k ⊙ c1 ∥ c,k 1 y+ τ + + y τ y c,k τ c1 1 c−,k τ c−1 1 captures only the contribution for the similarity, whereas c−,k ∥+ ⊙ ∥+ ∥ ⊙ ∥ ∥ ⊙ τ 1 + y τ 1 captures only the contribution for the dissimilarity between the vectors c1 ∥ ∥ c,k ⊙ c−1 ∥ yc,k and τ c1 . On the other hand, ς measures only the strength on the support intersection and does not take into account the contributions for similarity or dissimilarity. Unsupervised Prior Measure In the most general form, we propose the following − unsupervised measure:

l(yc,k,θ ) = minmax ρ(yc,k,τ c1 ) + ς(yc,k,τ c1 ) + ρ(yc,k,ν c2 ) . under unknown c1 c2 (5.35) label 

Assuming τ c1 and ν c2 are spread far apart in the transform domain, the minimum over

ρ(yc,k,τ c1 ) + ς(yc,k,τ c1 ) ensures that yc,k in the transform domain will be located to the closest τ c1 w.r.t. to the additive composition of measures ρ(yc,k,τ c1 ) + ς(yc,k,τ c1 ). At the same time the maximum over ρ(yc,k,ν c2 ) ensures that yc,k in the transform domain will be located to the farthest ν c2 w.r.t. to the measure ρ(yc,k,ν c2 ). Supervised Prior Measure Given the class label, the discriminative prior (5.32) re- − duces to p(θ yc,k) = p(τ c,ν c yc,k), that is the prior measure reduces to: | |

l(yc,k,θ ) = ρ(yc,k,τ c) ρ(yc,k,ν c) + ς(yc,k,τ c), (5.36) under known label − 2An improper prior is essentially a prior probability distribution that’s infinitesimal over an infinite range, in order to add to one. 88 Learning Robust and Discriminative NT Representation for Image Recognition where we used the fact that:

minmax ρ(yc,k,τ c ) + ς(yc,k,τ c ) + ρ(yc,k,ν c ) = c c 1 1 2 1 2 (5.37)  minmin ρ(yc,k,τ c1 ) ρ(yc,k,ν c2 ) + ς(yc,k,τ c1 ) . c1 c2 −  Reduction to Sparsifying Transform Model Note that if the similarity, dissimilarity and the strength contributions between yc,k and θ w.r.t. the support intersection based mea- sure l(yc,k,θ ) are zero, then p(θ yc,k) has no influence in our NT model. In that case, | our NT model reduces to the sparsifying transform model, i.e., p(xc,k yc,k,A)p(yc,k,θ ) = | p(xc,k yc,k,A)p(yc,k), with p(xc,k yc,k,A) modeling the sparsifying transform error and p(y) | | modeling the sparsity prior. This shows that the NT model can also be seen as an extension of the sparsifying model [126], [125] and [122].

5.2.5 Learning Nonlinear Transform with Priors

In this section, we introduce our problem formulation and unveil the empirical risk as discriminative objective for the learning problem. Minimizing the exact negative logarithm of our learning model:

p(Y,A X) = p(Y X,A)p(A X) = | | | C K ∏ ∏ p(yc,k,θ xc,k,A)dθ p(A xc,k) ∝ θ | | (5.38) c=1 k=1 Zθ  C K ∏ ∏ p(xc,k yc,k,A)p(θ yc,k)p(yc,k)dθ p(A xc,k), θ | | | c=1 k=1 Zθ  over Y, θ and A is difficult since we have to integrate in order to compute the marginal and the partitioning function of the discrimination prior.

Instead of minimizing the exact negative logarithm of the marginal probability p(yc,k xc,k,A) = | p(xc,k yc,k,A)p(θ yc,k)p(yc,k)dθ , we consider minimizing the negative logarithm of its θ | | Rmaximum point-wise estimate, i.e.,

p(xc,k yc,k,A)p(θ est yc,k)p(yc,k)dθ est Dp(xc,k yc,k,A)p(θ yc,k)p(yc,k), (5.39) θ | | ≤ | | Zθ est where we assume that θ are the parameters for which the factors p(xc,k yc,k,A) p(θ yc,k)p(yc,k) | | has the maximum value and D is a constant. Furthermore, we take the logarithm, use the proportional relation (8) and by disregarding the partitioning function related to the prior 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 89

(14), we end up with the following problem formulation:

log p(xc,k yc,k,A) log p(yc,k) C K − | − ˆ ˆ ˆ 2 Y,θ ,A = argminv(Y,A) = arg min  Axc,k yc,k 2 +λ1 yc,k 1+ , ∑ ∑ { } Y A Y,θ ,A c=1 k=1 ∥ − ∥ ∥ ∥ z }| { z }| {   log p(θ yc,k) log p(A)  (5.40) C K − | − ∑ ∑ λ0l(yc,k,θ )+ Ω(A), c=1 k=1 z }| { z }| { discrimative cost | {z } where λ0,λ1 are inversely proportional to β0,βl . { } { } Problem Formulation with Discriminative Cost in Implicit Form In the following, in- stead of using the parameters θ , we use a more appealing approximation which not only provides simplification in the learning stage, but also gives the possibility for efficient solution and interpretations. Our problem formulation with a discriminative cost in implicit form has the following form:

C K ˆ ˆ 1 2 Y,A = argminv(Y,A) = argmin ∑ ∑ Axc,k yc,k F + λ1 yc,k 1 + { } Y,A Y,A c= 2∥ − ∥ ∥ ∥ 1 k=1   (5.41) ∑c,k l(yc,k,θ )

λ0 LP(Y) +Ω(A), z }| { P P P where in the supervised case, LP(Y) = D (X) R (X) + S (X) , and in the unsuper- ℓ1 − ℓ1 ℓ2 P P P vised case, LP(Y) = D (X) R (X) + S (X) .  ℓ1 − ℓ1 ℓ2 Supervised Discrimination Prior Likelihood in Implicit Form Using the labels and all − of the nonlinear transform representations, we can implicitly express the parameters θ such that they are eliminated from (5.40). To do so, note that we use an empirical approximation, i.e.: 1 C K E[ log p(θ yc,k)] E[l(yc,k;θ )] ∑ ∑ l(yc,k,θ ), (5.42) − | ≃ ∼ CK c=1 k=1 which can also be expressed as:

1 C K 1 l(y ,θ ) DP (X) RP (X) + SP (X) , ∑ ∑ c,k ℓ1 ℓ1 ℓ2 (5.43) CK c=1 k=1 under∼ known CK − labels   90 Learning Robust and Discriminative NT Representation for Image Recognition where: C K K DP (X) = ρ(y ,y ), ℓ1 ∑ ∑ ∑ ∑ c,k c1,k1 c=1 k=1 c1 1,...,C c k1=1 ∈{{ }\ } C K RP (X) = ρ(y ,y ), ℓ1 ∑ ∑ ∑ c,k c1,k1 (5.44) c=1 k=1 k1 1,...,K k ∈{{ }\ } C K K SP (X) = ς(y ,y ), ℓ2 ∑ ∑ ∑ ∑ c,k c1,k1 c=1 k=1 c1 1,...,C c k1=1 ∈{{ }\ } approximate the expected dissimilarity contribution, the expected similarity contribution approximates the expected strength contribution, respectively. In general, assuming A is known, term (5.43) establishes a link between a parametric and a non-parametric modeling view (the proof is given in Appendix C.6). Note that the Fisher discriminate constraint [163], the pairwise constraint [53] and the support intersection constraint [86] are all approximations of a discriminative prior. The advantage of using (5.43) is that it is without any prior to the probability distributions p(θ ) and without any explicit assumption about the metric/measure, or space/manifold in the transform domain. Unsupervised Discrimination Prior Likelihood in Implicit Form In an unsupervised − M CK 1 M CK 1 case, we use a constrained likelihood based vectors ec,k ℜ , ec,k ℜ (Ap- ∈ × − ∈ × − pendix C.4) and all of the NT representations in order to implicitly express the parameters θ , such that they are eliminated from (5.40). Similarly as in the supervised case, we can use an empirical approximation (5.42), where now we also take into account a constrained likelihood (Appendix C.4) as an encoding function, and express the the expected logarithm of the prior as:

1 C K 1 l(y ,θ ) DP (X) RP (X) + SP (X) , ∑ ∑ c,k ℓ1 ℓ1 ℓ2 (5.45) CK c=1 k=1 under unknown∼ CK − labels   5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 91 where:

1 C K C K DP (X) = e (i )ρ(y ,y ), ℓ1 CK 1 ∑ ∑ ∑ ∑ c,k c1,k1 c,k c1,k1 − c=1 k=1 c1=1 k1=1 1 C K C K RP (X) = e (i )ρ(y ,y ), ℓ1 CK 1 ∑ ∑ ∑ ∑ c,k c1,k1 c,k c1,k1 − c=1 k=1 c1=1 k1=1 SP (X) = (5.46) ℓ1 C K C K C K 1 √ e (ic ,k )e (ic ,k )ς(yc,k,(yc ,k yc ,k ) ⊙)+ CK 1 ∑ ∑ ∑ ∑ ∑ ∑ c,k 1 1 c,k 2 2 1 1 ⊙ 2 2 − c=1 k=1 c1=1 k1=1 c2=1 k2=1 1 C K C K e2 (i )ς(y ,y ) , CK 1 ∑ ∑ ∑∑ c,k c1,k1 c,k c1,k1 − c=1 k=1 c1 k1 !

√ while √ is the Hadamard square root, i.e., (yc,k yc,k) = yc,k, ic ,k = k1 + (c1 1)K ⊙ ⊙ ⊙ 1 1 − and ic ,k = k2 + (c2 1)K. The proof is given in Appendix C.7. 2 2 − Interpretations and Connections In the following we provide interpretations of the dis- criminative cost in implicit form as well as show the connection to the discrimination related empirical risk. Connection to Discrimination Density To add more to the interpretation of the dis- − crimination component LP(Y) in our objective (5.41), we note that (5.43) (or (5.45)) also p(θ 1,yc,k) represents an approximation to a regularized discrimination density E[ log ], i.e., − p(θ 2,yc,k) p(θ ,y ) [ log 1 c,k ] DP (X) RP (X) E ℓ1 ℓ1 (5.47) − p(θ 2,yc,k) ∼ − where the difference DP (X) RP (X) (or DP (X) RP (X)) can be seen as a finite sample ℓ1 − ℓ1 ℓ1 − ℓ1 p(θ 1,yc,k) approximation of E[ log ] (the proof is given in Appendix B), while SP (X) (or − p(θ 2,yc,k) ℓ2,c SP (X)) represents the regularization, i.e., the expected strength on the support intersection. ℓ2,c Connection to Discrimination Related Empirical Risk Our main goal is to estimate − a NT that produces discriminative representations, which w.r.t. our model boils down to estimating the model parameters that minimize the empirical expectation of the discrimination prior, i.e., E[ log p(θ yc,k)] = E[l(θ ,yc,k)] + E[z(θ ,yc,k)] (5.48) − | 1 where E[z(θ ,yc,k)] is the partition function, while E[l(θ ,yc,k)] ∑c,k l(yc,k;τ c). Further- ∼ CK more, we showed that by considering (5.43) (or (5.45)) and having the corresponding class la- 92 Learning Robust and Discriminative NT Representation for Image Recognition

bels (or a constrained likelihood based vectors) for the training set X, the term ∑c,k l(yc,k;τ c1 ) exactly equals to DP (X) RP (X) + SP (X) (or to DP (X) RP (X) + SP (X)). ℓ1 − ℓ1 ℓ2 ℓ1 − ℓ1 ℓ2 On the other hand, the minimum of DP (X) RP (X)+SP (X) (or to DP (X) RP (X)+ ℓ1 − ℓ1 ℓ2 ℓ1 − ℓ1 SP (X)), can be reached under two cases. The first one is when the NT is estimated such ℓ2 that the resulting NT representations have zero dissimilarity contribution, zero similarity contribution and zero strength contribution on the support intersection, which leads to their distinguishably and discrimination. The second case is when the NT is estimated such that the resulting NT representations contribute towards a balance between the dissimilarity contribution and the similarity contribution, which is a form of a trade-off. In both cases, DP (X) RP (X) + SP (X) (or to DP (X) RP (X) + SP (X)) can be seen as the empirical ℓ1 − ℓ1 ℓ2 ℓ1 − ℓ1 ℓ2 risk [144] of the model that also gives connections to its generalization capabilities [144] and [95], and indicates the objective that is responsible for the discrimination functionality of the nonlinear transform.

5.2.6 A Solution by Iterative Alternating Algorithm

Jointly solving (5.41) over A,Y is again difficult even with discrimination cost in im- { } plicit form. Alternately, the solution of (16) per any of the variables A or Y can be seen as an integrated marginal maximization (IMM) of p(Y,A X) = p(Y X,A)p(A X) that is | | | approximated by (5.74) and (5.39), which is equivalent to:

1) Approximately maximizing with p(xc,k yc,k,A) and the implicit form of the prior | p(θ ,yc,k) = p(θ yc,k)p(yc,k) over yc,k | C K 2) Approximately maximizing with ∏ ∏ p(xc,k yc,k,A) and the prior p(A) = p(A xc,k) c=1 k=1 | | over A.

In this sense, based on the IMM principle, we propose an iterative, alternating algorithm

that has two stages: (i) NT representation yc,k estimation, and (ii) linear map A update. NT Representation Estimation Note that our model (5.63) has no restrictions on the

transform representation yc,k w.r.t. the transform matrix A. However, the estimation of yc,k depends on the approximation to the prior measure l(θ ,yc,k). Nonetheless, given A, and the class labels (or using an encoding function) note that (5.41) decomposes per yc,k. Moreover, we show that the approximation (5.43) (or (5.45)) leads to a efficient solution.

Given A,Y c,k = [y1,1,...,yc,k 1,yc,k+1,...,yC,K], denote qc,k = Axc,k, then for any yc,k, \{ } − (5.41) reduces to the following constrained projection problem:

1 2 T T T T yˆc,k = min Axc,k yc,k 2 + λ0 gc yc,k vc yc,k + sc (yc,k yc,k) + λ11 yc,k , (5.49) yc,k 2∥ − ∥ | | − | | ⊙ | |  5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 93

T and if (λ0(gc vc) + λ11) yc,k 0, it has a closed form solution as: − | | ≥

yc,k =sign qc,k max qc,k + λ0(vc gc) λ11,0 (1 + 2λ0sc), (5.50) ⊙ | | − − ⊘   where denotes the Hadamard division, ⊘ + gc =sign(max(qc,k,0)) d + sign(max( qc,k0)) d−, ⊙ 1 − ⊙ 1 (5.51) + vc =sign(max(qc,k,0)) d + sign(max( qc,k,0)) d−. ⊙ 2 − ⊙ 2 Supervised Case In the supervised case, − K d+ = y+ = Y+e , 1 ∑ ∑ c1,k1 1,c c1 1,...,C c k1=1 ∈{{ }\ } K d = y = Y e 1− ∑ ∑ c−1,k1 − 1,c c1 1,...,C c k1=1 ∈{{ }\ } + d2− = ∑ yc−,k1 = Y e2,c, (5.52) k1 1,...,K k ∈{{ }\ } + + d2 = ∑ yc,k1 = Y−e2,c, k1 1,...,K k ∈{{ }\ } K sc = yc ,k yc ,k = (Y Y)e1,c, ∑ ∑ 1 1 ⊙ 1 1 ⊙ c1 1,...,C c k1=1 ∈{{ }\ } where yc,k denotes the vector having as elements the absolute values of the corresponding | | CK CK elements in yc,k and e1,c ℜ and e2,c ℜ are the respective encoding vectors. The ∈ ∈ proof is given in Appendix C.6. T t 1 Unsupervised Case While in the unsupervised case, first, we denote U c,k = Y −c,k − \{ } t 1 \{ } and uc,k = y − as the estimates at iteration t 1 and construct all r and rc,k as follows:  c,k − c,k

+ + + + rc,k = U− c,k uc−,k + U c,k uc,k,rc,k = U− c,k uc,k + U c,k uc−,k . (5.53) \{ } \{ } − \{ } \{ }   94 Learning Robust and Discriminative NT Representation for Image Recognition

Furthermore, we compute the corresponding ec,k and ec,k using a constrained likelihood as given in Appendix C.4, and then evaluate

C K d+ = e (i )y+ = Y+e , 1 ∑ ∑ c,k c1,k1 c1,k1 c,k c1=1 k1=1 C K d = e (i )y = Y e , 1− ∑ ∑ c,k c1,k1 c−1,k1 − c,k c1=1 k1=1 C K d+ = e (i )y+ = Y+e , 2 ∑ ∑ c,k c1,k1 c1,k1 c,k c =1 k =1 1 1 (5.54) C K d = e (i )y = Y e , 2− ∑ ∑ c,k c1,k1 c−1,k1 − c,k c1=1 k1=1 C K C K sc = e (ic ,k )e (ic ,k ) yc ,k yc ,k + ∑ ∑ ∑ ∑ c,k 1 1 c,k 2 2 1 1 ⊙ 2 2 c1=1 k1=1 c2=1 k2=1 C K 2 e (ic ,k )yc ,k yc ,k , ∑∑ c,k 1 1 1 1 ⊙ 1 1 c1 k1 where ic ,k = k1 +(c1 1)K and ic ,k = k2 +(c2 1)K. The proof is given in Appendix C.7. 1 1 − 2 2 − Note that by (5.50) all yc,k can be computed in parallel.

Linear Map Estimation Given the available data X and the corresponding transform representations Y, (5.40) reduces to a problem related to estimation of the linear map A, i.e.,

ˆ 2 λ2 2 λ3 T 2 T A = argmin AX Y F + A F + AA I F λ4 log detA A , (5.55) A ∥ − ∥ 2 ∥ ∥ 2 ∥ − ∥ − | | where λ2,λ3,λ4 are inversely proportional to the scaling parameters β3,β4,β5 . At this { } {M CK } N CK step, we propose an approximate closed form solution. Given Y ℜ × , X ℜ × ∈ ∈ T and M N, λ2 0,λ3 0 and λ4 0 let the eigen value decomposition UX ΣX V of ≥ ∀ ≥ ≥ ≥ X XXT + I and the singular value decomposition U Σ VT of UT XYT exist, then λ2 UX XY ΣUX XY UX XY X if and only if ΣX (n,n) > 0, n N = 1,...,N ,(5.55) has an approximate solution as: ∀ ∈ { } T 1 T (5.56) A = VUX XY UUX XY ΣAΣX− UX , where ΣA is diagonal matrix, ΣA(n,n) = σA(n) 0 , and σA(n) are solutions to quartic ≥ polynomials (the proof is given in Chapter4, Appendix B.1 and B.2).

Convergence In the following, we give the algorithm convergence result. 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 95

Theorem 5 Given data X, the initial linear map A0 and the NT representations Y0, let At,Yt denote the iterative sequence generated by the approximate closed form solution { } (Chapter4, Appendix B.1 and B.2) and the closed form solution (5.50). For any encoding t t t t function that produces all of the corresponding encoding vectors e1,c,e2,c (or ec,k ec,k ) in t+1 order to construct the implicit discriminative objective Lp(Y ), if it holds:

t+1 t Lp(Y ) Lp(Y ), and (5.57) ≤ T λ0,λ1 such that (λ0(gc vc) + λ11) yc,k 0, c 1,..K ,k 1,...,K , (5.58) ∃ − | | ≥ ∀ ∈ { } ∈ { } then the algorithm sequence of the objective function values is a monotone decreasing sequence: v At+1,Yt+1 v At+1,Yt v At,Yt , (5.59) ≤ ≤ and converges to a finite value denoted ∗ asv .   t 0 t 0 t 0 t 0 Proof: Let all e1,c = e1,c,e2,c = e2,c (or ec,k = ec,k and ec,k = ec,k ) be fixed. Note that in the linear map update, we use the approximate closed form solution that preserves the gradient (Chapter4, Appendix B.1 and B.2) and in the NT representation estimation step, we use a closed form solution (5.50) with the constraint under (5.58). Therefore, we have a guaranteed decrease of the objective after one iteration, when we update the linear map or the NT representation. t t t t Otherwise, if per every iterations, we compute e1,c,e2,c (or ec,k and ec,k ) and (5.57) is true then, we will have a guaranteed decrease of the objective from one to another iteration, if we use approximate closed form solutions (Chapter4, Appendix B.1 and B.2) and an exact closed form solutions (5.50) with satisfied constraint (5.58)  We can now claim that our iterative algorithm based on the IMM principle allows us to C K find only a joint local maximum in A and Y for ∏ ∏ p(xc,k yc,k,A)p(θ ,yc,k), such c=1 k=1 | that the conditional probabilities p(xc,k yc,k,A) and the joint probabilities p(θ ,yc,k) are | maximized.

5.2.7 Evaluation of Algorithm Properties, Discrimination Quality and Recognition Accuracy

In this section, we introduces a measure that quantifies the discriminative properties ofa data set under a transform. Then we give our evaluation and shown a comparison to the state-of-the-art methods of the same category. Quantifying a Discrimination Quality Using a relation between the two concentrations DP (X) and RP (X), we define the discriminative properties of a data set under a transform ℓ1 ℓ1 96 Learning Robust and Discriminative NT Representation for Image Recognition with parameter set P = A ℜM N,τ ℜM , as follows: { ∈ × ∈ }

I t = log(RP (X)) log(DP (X) + ε), (5.60) ℓ1 − ℓ1 where ε > 0 is a small constant. By letting A = I and τ = 0 in P, (5.60) allows to measure the discrimination power for the original data representations in a given data set X. The advantage of the measure (5.60) is that it logarithmically signifies discriminative density, that is the difference between RP (X) and DP (X) (the proof is given in Appendix C.5). The ℓ1 ℓ1 following numerical evaluation shows that this measure is important because it is implicitly related to the recognition capabilities. At the same time, it gives insight into the learning dynamics of the proposed algorithm. In the following text, we provide our numerical evaluation. Data Sets The used data sets are Extended YALE B [47], AR [96], Norb [83], Coil-20 [105], Clatech101 [84], UKB [108], MNIST [82], F-MNIST [157] and SVHN [106]. All the images from the respective data sets were downscaled to resolutions of 21 21, 32 28, × × 24 24, 20 25, 21 21, 20 25, 28 28, 28 28 and 28 28, respectively, and were × × × × × × × normalized to unit variance. Evaluation Summary The numerical experiments consist of three parts. NT Properties In the first series of the experiments, we evaluate the computational − efficiency as the run time te[min] of the algorithm. In addition, we report the conditioning number κn(A), and the expected mutual coherence µ(A) for the linear map in the NT. The conditioning number is defined as κ (A) = σmax , where σ and σ are the smallest and n σmin min max the largest singular values of A, respectively, and the expected coherence µ(A) is computed as by (5.31) (or as defined in Chapter4) for the linear map A. We show the discrimination power for NT representations that result from using learned NT over the corresponding data set. NT Discrimination Quality In the second part, we present a comparison between the − discrimination power under different transforms. We consider a linear mapping with identity matrix which corresponds to the original data domain, a random linear transform mapping by using a random matrix as a linear map (having a transform dimension of M = 19000) and a learned NT having transform dimension M = 19000 without and with discriminative prior. We denote the transforms as I 0, I RT , I ST ∗ and I NT , respectively. Recognition Accuracy Comparison Proposed vs State-of-the-art The third part evalu- − ates the discrimination power and the recognition accuracy using the representations from our model as features and compares it to several state-of-the-art methods, including: 1) Supervised discriminative dictionary learning (DDL) methods [118], [162], [148] and [149] 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 97

0.08 MNIST 0.08 MNIST NORB NORB 0.07 CalTech101 0.07 CalTech101 UKB UKB 0.06 E-YALE-B 0.06 E-YALE-B AR AR 0.05 Coil20 0.05 Coil20

C1 0.04 C2 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of iterations Number of iterations

MNIST NORB CalTech101 UKB E-YALE-B AR Coil20 MNIST 2.5 250 NORB CalTech101 UKB 2 200 E-YALE-B AR Coil20 150 1.5

C1/C2 100 1 log(C1/C2)

50 0.5

0 0 5 10 15 20 25 30 0 0 5 10 15 20 25 30 Iteration number Number of iterations Fig. 5.10 The evolution of the approximations C = RP (X) and C = DP (X), their ratio 1 ℓ1 2 ℓ1 t C1/C2 and the discrimination power log(C1/C2) = I during the learning of the nonlinear transform with transform dimension M = 9100. ×10 -3 40 5 MNIST MNIST 35 NORB NORB CalTech101 4 CalTech101 30 UKB UKB E-YALE-B E-YALE-B 3 25 AR AR Coil20 Coil20 (A)

20 (A) 2 n µ C 15 1 10 0 5

0 -1 0 5 10 15 20 0 5 10 15 20 Transform dimension (M) Transform dimension (M)

Fig. 5.11 The conditioning number κn(A) = Cn(A) and the expected mutual coherence µ(A) for the learned linear map A in the NT at different transform dimensionality M Q. ∈

2) Unsupervised feature learning methods [106], [63], [109] 3) Different classifiers [157] 4) Deep neural networks [50], [93], [85], [139]. In this comparison the used data are divided into a training and test set. We use the NT representations in a k-NN classifier and also to train a SVM classifier, where the learning is performed on the training set and the evaluation is performed on the test set. In the comparison w.r.t. the DDL methods we consider a setup where half of the data samples from the data set Extended YALE B, sampled at random are used for learning and the remaining other half are used for evaluation. Algorithm Setup An on-line variant is used for the update of A w.r.t. a subset of the available training set. It has the following form At+1 = At ρ(At Aˆ t) where Aˆ is the − − solution in the transform update step and ρ a predefined step size. The used batch size is equal to 12% of the total amount of the available training data. 98 Learning Robust and Discriminative NT Representation for Image Recognition

O RT ST NT κn(A) µ(A)te[min] I I I ∗ I YALE B 2.21 0.03 5.10 YALE B 0.03 0.18 0.68 1.98 AR 1.80 0.02 5.45 AR 0.02 0.10 1.30 1.79 Norb 2.12 0.02 6.55 Norb 0.00 0.01 0.71 1.61 Coil20 0.08 0.02 8.92 Coil20 0.08 0.61 0.89 1.89 Caltech101 6.01 0.01 12.8 Caltech101 0.01 0.16 1.02 2.12 UKB 33.1 0.02 30.1 UKB 0.06 0.53 1.36 3.36 MNIST 1.60 0.02 5.00 MNIST 0.13 0.63 1.06 1.96 a) b) Table 5.3 a) The conditioning number κn(A) and the expected mutual coherence µ(A) for the learned linear map A in the NT, and the execution time te[min] in minutes of the proposed algorithm for 28 iterations using NT with dimensionality M = 19000 . b) The discrimination power in the original domain I O, after a transform with random linear map I RT , after using learned sparsifying transform I ST and after using learned NT I NT .

YALE B AR NORB COIL20 MNIST F-MNIST SNT te[h] .30 .33 .38 .53 .41 .42

UNT te[h] .48 .69 .71 .46 1.9 1.9

The parameters λ0 and λ1 are set such that the resulting NT representation has a very small number of non-zeros w.r.t. the transform dimension. In the experiments this number is set to be 15. The rest of the parameters are set as λ2,λ3,λ4 = 1000000,1000000,1000000 . { } { } The algorithm is initialized with a random matrix having i.i.d. Gaussian (zero mean, variance one) entries and is terminated after the 28th iteration. The results are obtained as the average of 3 runs. An implementation [149] was used to learn the dictionaries and estimate the sparse codes for the respective DDL methods [118], [162], [148] and [149].

Results in the Supervised Case In this subsection, we present the results of our learning algorithm that is trained under label knowledge.

NT Properties Results The running time te, measured in minutes for NT with − M = 19000 is shown in Table 5.4. The learned NT for all the data sets have relatively low execution time, despite of the very high transform dimension. In Table 5.4 and in Figure 5.11 are shown the values κn(A) and µ(A) that are evaluated for the linear maps A at the corresponding learned NT at different transform dimensions M Q = ∈ 100,1150,2200, 3250,4300,5350,6400,7450,8500,9550, 10600,11650, 12700, 13750, { 14800,15850, 16900,17950,19000 per all of the used data sets. All linear maps form the } NTs that were estimated for all dimensions M Q over all the data sets have good condition- ∈ ing numbers and low expected coherence. Moreover, as we can see in Figure 5.11, both, the 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 99

O RT ST NT κn(A) µ(A)te[min] I I I ∗ I YALE B 2.21 0.03 5.10 YALE B 0.03 0.18 0.68 1.98 AR 1.80 0.02 5.45 AR 0.02 0.10 1.30 1.79 Norb 2.12 0.02 6.55 Norb 0.00 0.01 0.71 1.61 Coil20 0.08 0.02 8.92 Coil20 0.08 0.61 0.89 1.89 Caltech101 6.01 0.01 12.8 Caltech101 0.01 0.16 1.02 2.12 UKB 33.1 0.02 30.1 UKB 0.06 0.53 1.36 3.36 MNIST 1.60 0.02 5.00 MNIST 0.13 0.63 1.06 1.96 a) b) Table 5.4 a) The conditioning number κn(A) and the expected mutual coherence µ(A) for the learned linear map A in the NT, and the execution time te[min] in minutes of the proposed algorithm for 28 iterations using NT with dimensionality M = 19000 . b) The discrimination power in the original domain I O, after a transform with random linear map I RT , after using learned sparsifying transform I ST and after using learned NT I NT .

MNIST NORB CalTech101 UKB E-YALE-B AR Coil20 2.5 0.04 MNIST NORB 0.035 CalTech101 UKB 2 0.03 E-YALE-B AR 0.025 Coil20 1.5

C2 0.02 1

0.015 log(C1/C2)

0.01 0.5 0.005

0 0 0 5 10 15 20 0 5 10 15 20 Transform dimension (M) Transform dimension (M) Fig. 5.12 The approximation C = RP (X) and the discrimination power t on a subset of 2 ℓ1 I the transform data using learned NT at different transform dimensionality M Q. ∈ conditioning number and the coherence reduce as the transform dimension increases, which confirms the effectiveness of the used minimum information loss constraint. NT Discrimination Quality Results Table 5.4 and Figures 5.10 and 5.12 show the − results. The discrimination power is significantly increased in the transform domain I NT compared to the one in the original domain I O and is higher than I ST ∗ and I RT . In Figure 5.10, we show the evolution of the approximations C = DP (X) and C = RP (X), 1 ℓ1 2 ℓ1 t their ratio C1/C2 and the discrimination power log(C1/C2) = I for subsets of the used databases after applying a nonlinear transform with transform dimension M = 9100. It can be noticed that the approximations C1 and C2 are decreasing, meaning that there is a loss of information. However, how this loss effects the resulting similarity concentration is crucial for the discrimination properties. As shown in Figure 5.10, the slope of decrease for C2 is stronger and the discrimination power increases per iteration. For the Coil-20 (D4) data set there is a fluctuation. This is explained by the fact that during learning we used asmall number of data samples from the same data set and that in the data there is high variability. 100 Learning Robust and Discriminative NT Representation for Image Recognition

Yale B MNIST Yale B MNIST II Acc. [%] Acc. [%] DLSI 0.71 0.67 DLSI 96.5 DLSI 98.74 FDDL 0.87 0.63 FDDL 97.5 FDDL 96.31 COPAR 0.57 0.54 COPAR 98.3 COPAR 96.41 LRSDL 0.42 0.40 LRSDL 98.7 LRSDL NT 0.98 0.77 NT[4K] 99.7 NT[8K] 98−.75 Table 5.5 a) The discrimination power I for the methods DLSI[118], FDDL [162], COPAR [148], LRSDL [149] and the proposed NT, and the recognition results using a nonlinear transform with different dimensionality M and linear SVM classifier on top of the transform representation for the Extended Yale B and MNIST database.

t The approximation C1 and the discrimination power I for a subsets of the used databases after applying our NT having transform dimensions M Q is shown in Figure 5.12. We ∈ can see similar behavior as previous, that is, C1 and C2 are decreasing, but, the slope of decrease for C2 is stronger and the discrimination power increases as the transform dimension increases. NT vs DDL Discrimination Power Results In all of the comparing algorithms, the − dictionary size of the comparing methods is set to be equal to 150,1515, 570,300 for the { } used YALE B and MNIST databases. The discrimination power of the comparing methods is denoted as I DLSI, I FDDL, I COPAR and I LRSDL. The results are shown in Table 5.5. The discrimination power of the proposed NT is higher that the discrimination power of the other methods. NT vs DDL Recognition Accuracy Results The recognition results for the methods − DLSI, FDDL, COPAR and LRSDL on the data sets Extended YALE B and MNIST were not computed here, rather we use the best reported result form the respective papers [118], [162], [148] and [149]. The NT was learned for the transform dimensions M = 100,500,1500,4000 and M = 1000,4000,6000,8000 , respectively, for the used data { } { } sets. After the NT was learned, the NT representations were computed for the respective training and test sets. Then, the training NT representations were used as features to learn a linear SVM classifier in a one-against-all regime. The results are shown in Table 5.5, where for our method we can see higher recognition accuracy at higher dimensionality. In comparison, our method outperforms the DDL methods, where we report higher recognition accuracy at dimensionality 4000 and 8000 for the respective YALE B and MNIST data sets. In Figures 5.13 and 5.14 we show the recognition accuracy and the expected loss mea- zc,k 2 1 2 sured, i.e., E[ ] = E[ Axc,k yc,k ] as functions of the discrimination power for ∥ M ∥2 M ∥ − ∥2 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 101

100 100 E-YALE-B MNIST 98 98 96

94 96

Accuracy 92 Accuracy 94 90

88 92 0.2 0.4 0.6 0.8 1 1.2 1.2 1.3 1.4 1.5 Discriminative power Discriminative power M 100 500 1.5K 4K M 1K 4K 6K 8K Acc. [%] 89.1 94.3 97.4 99.7 Acc. [%] 92.49 94.24 96.01 98.75 I t 0.21 0.73 0.93 1.23 I t 1.18 1.35 1.42 1.55 Fig. 5.13 The recognition results and the discrimination power on the Extended Yale B and MNIST databases, respectively, using a NT with different dimensionality M and linear SVM classifier on top of the transform representation. ×10 -4 ×10 -4 E-YALE-B MINST 3 3 2.5 2 2 1.5 1 1 0.5 0 0.2 0.4 0.6 0.8 1 1.2 1.2 1.25 1.3 1.35 1.4 1.45 Discrimination power Discrimination power Expected loss per dimension Expected loss per dimension M 100 500 1.5K 4K zc,k 3 E[ M ][10− ] .301 .034 .014 0.008 M 1K 4K 6K 8K t I .210 .730 .930 1.230 zc,k 4 E[ M ][10− ] .312 .082 .060 .041 I t 1.18 1.35 1.42 1.48

2 zc,k 2 Axc,k yc,k 2 Fig. 5.14 The expected loss E[ M 2] = E[ ∥ M− ∥ ] and the discrimination power on the Extended Yale B and MNIST∥ databases,∥ respectively. The transform representation Y is obtained by using a nonlinear transform TP with different dimensionality. the data sets YALE B and MNIST using learning NT at different transform dimentional- ity. Moreover, we evaluated the discrimination power, the recognition accuracy and the the expected loss for learned NT at transform dimensions M = 100,500,1500,4000 and { } M = 1000,4000,6000,8000 . It is interesting to highlight that with the increase in dis- { } crimination power there is also an increase in the accuracy of recognition over the NT representations. Moreover, the results on these two data sets show that this increase is approximately linear. On the other hand the expected loss decreases with the increase of the transform dimension.

Results in the Unsupervised Case In this subsection, we present the results for the unsuper- vised variant of our learning algorithm where label knowledge is not used. 102 Learning Robust and Discriminative NT Representation for Image Recognition

90

80

70 ×10 -3 2.5 60 2 50

40 1.5 AR E-YALE-B Coil20 NORB MNIST

30 AR E-YALE-B Coil20 NORB MNIST 1

Conditioning number 20 0.5 10

0 0 1 2 3 4 5 6 Expected mutual coherence 1 2 3 4 5 6 Transform dimension Transform dimension

λmax Fig. 5.15 The expected mutual coherence µ(A) and the conditioning number κn(A) = λmin for the learned transform matrix A at dimensionality M QU . ∈

5 AR E-YALE-B MNIST F-MNIST Coil20 NORB 4

3

2

1

0

Discrimination power -1 0 20 40 60 80 100 Iteration number Fig. 5.16 The evolution of the discrimination power I t for 100 algorithm iterations on a subset of the transform data using UNT at transform dimension M = 5884.

UNT Properties The running time te, measured in minutes for unsupervised nonlinear − transform (UNT) learning with M = 5884 is shown in Table 5.6. Similarly as in the supervised case, the learned UNT for all the data sets have relatively low execution time, despite the very

high transform dimension. The conditioning number Cn and the expected coherence µ(A) for the learned UNT at different transform dimensions M QU = 300,1417,2534,3650,4767, ∈ { 5884 per all of the used data sets are shown in Figure 5.15. The linear maps A in the } UNTs at all dimensions per all of the data sets have good conditioning numbers and low expected coherence. In Figure 5.15, we see that the conditioning number increases and then decreases as the transform dimension increases. We explain this by the fact that at transform dimension 300, we actuality shrink the dimensionality and in that case the constraint about the conditioning is in a different regime. While for the rest of the cases, where we expand the dimensionality, the constraint is effective. The behavior is also similar w.r.t. the expected mutual coherence, which decreases while the transform dimension increases. UNT Discrimination Quality Results Table 5.7 and Figures 5.16 show the results. In − Figure 5.16 is shown the evolution of the discrimination power I t of the UNT representations. The transform dimensionality is set to M = 2534 while we used subsets form the data sets AR, Extended Yale, Coil20, NORB and MNIST. As shown in Figure 5.16, the discrimination power increases per iteration, but there are fluctuations. This behavior appears because we 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 103

AR YALE B COIL20 NORB MNIST F-MNIST k-NN Acc. using OD [%] 96.1 95.4 96.8 97 96.4 85.5 k-NN Acc. using UNT [%] 96.9 96.3 97.4 97 97.7 87.1

learning time [h] .484 .692 .711 .461 1.93 1.87 Table 5.6 The recognition results and the learning time in hours on all database, respectively. We show the k-NN accuracy on the original data (OD) representation and the k-NN accuracy on the NT representation where the UNT is learned in the unsupervised case and has dimension M = 5884.

YALE B MNIST YALE B MNIST II Acc. [%] Acc. [%] DLSI 0.71 0.67 DLSI 96.5 DLSI 98.74 FDDL 0.87 0.63 FDDL 97.5 FDDL 96.31 COPAR 0.57 0.54 COPAR 98.3 COPAR 96.41 LRSDL 0.42 0.40 LRSDL 98.7 LRSDL unsupervised k-NN unsupervised k-NN− UNT 0.89 0.71 UNT 96.3 UNT 97.7 a) b) c) Table 5.7 a) The discrimination power for the sparse representations of the methods DLSI [118], FDDL [162], COPAR [148] and LRSDL [149] and the proposed method UNT, b), c) The recognition results on the Extended Yale B and MNIST for the methods DLSI [118], FDDL [162], COPAR [148] and LRSDL [149] compared to the kNN results on the UNT representations learned in the unsupervised setup. used a batch variant of the algorithm and since in the unsupervised case we do not have labels and use a constrained likelihood based encoding which has errors. UNT vs State-of-the-Art Results Considering the evaluation of the discrimination power, − in all the algorithms, the dictionary size (transform dimension M) is set to be equal to 150,300 for the used databases, respectively. The results are shown in Table 5.7, where { } we see that the discrimination power for the UNT representations are higher that the discrimi- nation power of the comparing methods on the Yale B and MNIST databases. At the time the unsupervised k-NN recognition accuracy is also comparable with the state-of-the-art DDL methods that use supervision. Table 5.6 shows the recognition results on all databases using k-NN as a classifier, where we see improvements w.r.t. k-NN on the data representations in the original data domain while having low computational complexity in the training phase. 104 Learning Robust and Discriminative NT Representation for Image Recognition

SVHN MNIST F-MNIST Method Acc. Method Acc. Method Acc. HOG [106] 85.10 LIF-CNN [63] 98.37 lOG-REG [157] 84.00 SSAE [106] 89.70 S-CW-A [93] 98.62 RF-C [157] 87.70 C-KM [106] 90.60 REG-L [109] 99.08 SVC [157] 89.98 S-CW-A [93] 93.10 F-MAX [50] 99.65 CNN [139] 92.10 TMA [85] 98.31 proposed kNN 97.70 proposed kNN 88.10 proposed kNN 86.41 l-svm 98.92 l-svm 91.62 l-svm 89.08 ∗ ∗ Table 5.8 Recognition accuracy comparison between state-of-the-art∗ methods and 1) K Nearest Neighbor kNN search and 2) linear SVM [61](l-svm) that use the Sparsifying Nonlinear Transform (sNT) representations from our model on extracted HOG [32] image features. We use our algorithm to learn the model on the HOG features. Then we get the sNT representations with dimensionality 7300 for the respective training and test sets. The training sNT representations are used to estimate the SVM parameters and the recognition is performed using the learned SVM on the test sNT representations.

Table 6.3 shows the comparison between the recognition results of the proposed method and the state-of-the-art. Although the deep neural networks achieve highest accuracy on MNIST, F-MNIST and SVHN, our approach in the comparison to the methods for un- supervised feature learning, as well as to the deep neural networks, [50], [139] and [85], demonstrates competitive performance while having low computational complexity in the training and the testing phase. This is especially pronounced on the MNIST and F-MNIST databases where the recognition accuracy of our method is 0.73% and 0.48% lower then the state-of-the-art methods for the respective data sets. Concerning the SVHN data set, we note that while the state-of-the-art methods use the labeled and the extra unlabeled training data sets in the learning stage, we only use the labeled training data set which is 20 less data used for learning. In addition, we also highlight ∼ × that in our approach, we used NT representations as features for a linear classifier (k-NN or trained linear SVM), whereas in most of the deep neural networks [63], [93], [109], [50], [139], [106] and [85], modeling by a multi layer architecture, local image content relations and/or aggregation and nonlinearities was taken into account. In summary, in this section, the numerical evaluation on the used data sets showed that our approach based on NT learning outperforms the comparing DDL methods w.r.t. the used discrimination measure and the recognition accuracy. Furthermore, the used linear classifiers learned on the respective NT representations achieve competitive performance w.r.t. the state-of-the-art deep learning based methods, while having low computational complexity. Moreover, we showed that when we expand with a NT to a high dimensional space how the 5.2 Supervised and Unsupervised Learning of Sparse and Discriminative NT for Image Recognition 105 loss of information in the sparse NT representations reflect in the similarity and dissimilarity P P approximations Dℓ (X) and Rℓ (X) is not only crucial for its discriminative properties, but that the proposed discrimination measure is strongly correlated with the recognition accuracy. 106 Learning Robust and Discriminative NT Representation for Image Recognition

5.3 Unsupervised Learning of Discrimination Specific, Self- Collaborative and NT Model for Image Recognition

In recent years, artificial intelligence and machine learning has had significant progress. Crucial to many approaches is the estimation/learning of task-relevant, useful and information preserving representation. To differentiate the data that originates from different groups, many unsupervised learning methods [98], [145], [7], [80], [147], [10], [48], [57] were proposed. Their primary target is to describe and identify the underlining explanatory groups within the data with (or without) data priors. Usually, a data representation expressed with respect to the groups is used as an unsupervised feature. Many discriminative descriptions were offered by sparse (or structured sparse) models [11], [71], [70], [8],[76], the discriminative clustering approaches [160], [80] and the dis- crimination dictionary learning methods [118], [162], [148] and [149], where it is assumed that the true data exhibits a form of sparse structure. However, to the best of our knowledge, a discrimination centered, collaboration structured form of sparse modeling was not explored. Due to the ambiguities in specifying a notion of discrimination and task focused collaboration, the related learning problem is challenging. The main open issues are the data model and the appropriate priors, which delimit the problem formulation and the definition of a suitable objective.

5.3.1 Motivations, NT Model and Learning Strategy Outline

In order to produce a sparse and discriminative representation that is useful for recognition applications, we study joint learning of multiple nonlinear transforms (NTs).

Motivations In general, the motivation follows from the fact that for any modeling usually we assume an error distribution that reflects the model correctness w.r.t. the true data distribution and the task at hand, which is also significant for the robustness in the estimate. At the same time, the right or wrong error assumption is crucial, since it leads to accumulation or removal of uncertainties related to the target-specific goal. Under a priori unknown error distribution of the model, its estimate might have high variability. A joint model of multiple NTs, where a relation between their errors is addressed might be more suitable. In this section, we explore such a relation, in order to add to the discrimination of the resulting representations from the NTs. Since even if the errors in the NT representations w.r.t. a discriminative objective have high variability, their joint treatment has the possibility to compensate and increase the discriminative properties in a composition of NT representations. 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 107

Discrimination Specific Self-Collaboration Model We introduce a novel target-specific, self-collaborative, NT model, where each NT is expressible by: (i) Linear mapping (ii) Additive self-collaboration component, which is followed by a (iii) Generalized element-wise nonlinearity.

With the linear mapping we have a number of linear projections that equals the linear map dimension. We can restrict a unique subset of them coupled with the additive collaboration component to be aligned to a particular subset of data samples3 and orthogonal to rest of the data samples. In order to be able to compactly cover the transform space and exclude trivially unwanted linear maps we impose a prior on the linear maps in the NTs, which we name the information loss prior. Moreover, to allow sparse and unique patterns in the NT representation, we impose discriminative prior on the NT representations and we add a collaboration corrective component by modeling a relation between the errors of the NTs. Regarding the estimation of the NT representation, the key difference of our NT model compared with the commonly used sparse models [90] and [125] with constraints is that we do not explicitly address the reconstruction of the data by a sparse linear combination. Rather than that we address a constrained projection problem and we estimate the NT representation as its solution.

Learning Strategy Given observed data, the model parameters are learned by minimizing an approximation to the empirical expectation of the negative log likelihood of the model. An approximation to the expected negative log likelihood of the collaboration component and the discriminative prior is interpreted as the discrimination empirical risk of the model. It also unfolds the actual discrimination objective in the optimization problem which we minimize during learning.

5.3.2 Contributions

Our contributions are given in the following: (i) We introduce a target-specific self-collaboration NT model. The main novelty in our model is that instead of focusing on a particular error distribution per NT, we focus on explicitly modeling the relationship between the NT errors, where our target-specific goal is in line with the self-collaboration discriminative and information loss priors. Along this line,

3In the supervised case, this subset is the set of all the available data samples that come from particular class. 108 Learning Robust and Discriminative NT Representation for Image Recognition

we introduce a self-collaboration functionality, which to the best of our knowledge is the first research work that extends and generalizes the sparsifying transform model [126] and [119]. The model offers a high degree of freedom in specifying single or combination of arbitrary number of objectives for the NT representations. It allows modeling of heterogeneous data as well as structuring per target-specific objectives, while providing a possibility to operate under supervised, unsupervised and semi-supervised centric self-collaboration. (ii) We propose an efficient learning strategy that we implement by an iterative alternating algorithm with two steps. At each step, we give an exact and an approximate closed form solution. In addition, we provide a convergence result for the iterating sequence of the objective function values generated by the iterating steps of the proposed algorithm with exact and approximate closed form solutions. (iii) We present numerical experiments that validate our model and learning principle on several publicly available data sets. Our preliminary results on an image recognition task demonstrate advantages in comparison to the state-of-the-art methods, w.r.t. the compu- tational efficiency in training and test time, the discriminative quality and the recognition accuracy.

5.3.3 Related Work

In the subsequent subsections, we first describe the common sparse models, including the sparsifying transform model [126], [125] and [121] that is the basis to the model that we pro- pose. Then, we give the related work in the line of discriminative and sparse representations.

Sparse Models Although in Chapter4, we introduced the related sparse models, in the following, we only give a short summary. Synthesis Model A synthesis model [90] and [125] (or regression model with sparsity − N regularized penalty) synthesizes a data sample xc,k ℜ as an approximation by a sparse M ∈ N linear combination yc,k ℜ ( yc,k 0 << M), of a few vectors dm ℜ , from a dictionary N M∈ ∥ ∥ N ∈ D = [d1,...,dM] ℜ , i.e., xc,k = Dyc,k + zc,k, where zc,k ℜ denotes the error vector ∈ × ∈ defined in the original data domain. M N Analysis Model It uses a dictionary Φ ℜ × with M > N to analyze the data − N ∈ xc,k ℜ . This model assumes that the product of Φ and xc,k is sparse, i.e., yc,k = Φxc,k ∈ M with yc,k 0 = M s, where 0 s M is the number of zeros in y ℜ [127] and [56]. ∥ ∥ − ≤ ≤ ∈ The vector yc,k is the analysis sparse representation of the data xc,k w.r.t. Φ. Transform Model In contrast to the synthesis model and similar to the analysis model − [90][119], [121] and [79], the sparsifying transform model does not explicitly target the data reconstruction. This model assumes that the data sample x is approximately sparsifiable 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 109

M N M under a linear transform A ℜ , i.e., Axc,k = yc,k + zc,k, zc,k ℜ , where y is sparse ∈ × ∈ yc,k 0 << M, and the error vector zc,k is defined in the transform domain. ∥ ∥ Structured Discrimination Constraints To address the estimation of discriminative sparse representations (in a supervised or unsupervised setup) the previous models usually are learned with constraints. The commonly used penalties are set on (i) the dictionary, (ii) the sparse representation (e.g., pairwise similarity, encoding w.r.t. a graph and structured sparsity) and (iii) the cost w.r.t. a classifier. In the following, we indicate the primary portion of related work. Structured Sparsity Methods Structured sparsity models were widely used in many − practical problems, including model-based compressive sensing [11], signal processing [90], [119], [121] and [56], computer vision [71], bio-informatics [154] and recommendation systems [76]. In this Section, we use structuring w.r.t. a targeted collaboration to reduce the uncertainty in the estimate w.r.t. the discriminative properties of the NT representations. Discriminative Dictionary Learning (DDL) Discrimination constraints were mainly − defined by exploiting labels. The class of algorithms is known as discriminative dictio- nary learning methods (DDL) [118], [162], [148] and [149]. Our approach addresses the unsupervised case with discrimination constraints. Discriminative Clustering and Auto-encoders There are also connections to the dis- − criminative clustering methods [160], [7][80] that we highlighted in the previous section next to the ones by the single layer auto-encoder [10] and its denoising extension [147]. A scalar, vector and matrix are denoted using standard, lower bold and upper bold case symbols as x, x and X, respectively. A set of L data representations is denoted as Y = M CK [Y1,...,YL], where the set of data representations Yl = [yl, 1,1 ,...,yl, C,K ] ℜ × . For { } { } ∈ every l 1,...,L , every class c C = 1,...,C has K samples, i.e., [yl, c,1 ,...,yl, c,K ] M K ∈ { } ∈ { } { } { } ∈ ℜ × . The set of L components for index c,k is denoted as Y c,k = [y1, c,k ,...,yL, c,k ]. { } { } { } { } The ℓp norm, the Hadamard product and division are denoted as . p , and , respectively. − ∥ ∥ ⊙ ⊘

5.3.4 Target Specific Self-Collaboration Model

Our model is centered around three components: (i) Self-collaboration nonlinear transform modeling (ii) Unsupervised and collaborative discriminative prior (iii) Min-max prior measure which includes a notion for similarity and dissimilarity contri- butions. 110 Learning Robust and Discriminative NT Representation for Image Recognition

Fig. 5.17 An illustration of the idea about our NT transform with self-collaboration relations that takes discrimination specific objective into account.

The model describes a generalized structured nonlinearity of L data representations M N yl, c,k ℜ , l 1,...,L that explain the data xc,k ℜ with collaborative uncertainty { } ∈ ∀ ∈ { } ∈ reduction term and priors. The joint model expresses as

p(Y c,k xc,k,A) = ... p(Y c,k ,θ 1,...,θ L xc,k,A)dθ 1...dθ L ∝ { }| θ θ { } | Zθ 1 Zθ L (5.61) ... p(xc,k Y c,k ,θ 1,...,θ L,A)p(Y c,k ,θ 1,...,θ L A)dθ 1...dθ L θ θ | { } { } | Zθ 1 Zθ L M CK where Y c,k = [y1, c,k ,...,yL, c,k ] ℜ × , θ = θ 1,...,θ L , θ l = θ l,1,θ l,2 , θ l,1 = { } { } { } ∈ { } { } A1 . M N τ l,1,...,τ l,C , θ l,2 = ν l,1,...,ν l,C , A = . and Al = ℜ . { 1 } { 2 } . × " AL # In order to simplify, we assume that:

p(xc,k Y c,k ,θ 1,...,θ L,A) =p(xc,k Y c,k ,A), | { } | { } (5.62) p(θ 1,...,θ L,Y c,k A) =p(θ ,Y c,k A) = p(θ ,Y c,k ). { }| { }| { } An illustration of the main concept is given in Figure 5.17.

Joint Modeling with Collaboration We model multiple nonlinear transforms with a collab- oration component as follows: L 1 T p(xc,k Y c,k ,A) ∝ ∏exp zl, c,k zl, c,k + fTSC(zl, c,k ,gA(Z c,k l)) , (5.63) | { } l=1 −β0 { } { } { } { }\  h i M M CK where zl, c,k ℜ , Z c,k l = [z1, c,k ,...,zl 1, c,k , zl, c,k , ...,zL, c,k ]ℜ × . The term { } ∈ { }\ M {M } − { } { } { } fTSC(zl, c,k ,gA(Z c,k l)) : ℜ ℜ ℜ denotes a target specific collaboration function, { } { }\ × → 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 111 that we define as:

T fTSC(zl, c,k ,gA(Z c,k l)) = zl, c,k gA(Z c,k l), (5.64) { } { }\ { } { }\ M M M where gA(Z c,k l) : ℜ ... ℜ ℜ denotes the collaboration aggregation function, { }\ × × → which we define as: g (Z ) = z , A c,k l ∑ l1, c,k { }\ { } (5.65) l1=l,l1 1,..,L ̸ ∈{ } ending up with

f (z ,g (Z )) = zT z , TSC l, c,k A c,k l l, c,k ∑ l1, c,k (5.66) { } { }\ { } { } l1=l,l1 1,..,L ̸ ∈{ }

Model Interpretation The model (5.63) assumes that the data sample xc,k indexed by k − M N from group c is approximately sparsifiable under any of the linear transforms Al ℜ , ∈ × l 1,...,L , the target specific collaboration function fTSC and the collaboration aggregation ∈ { } function gA, i.e., A x z = y + v , l c,k ∑ l1, c,k l, c,k l, c,k − { } { } { } (5.67) l1=l,l1 1,..,L ̸ ∈{ } M M where yl, c,k ℜ is the NT representation, vl, c,k ℜ is the corrected sparsifying error { } ∈ { } ∈ vector and zl, c,k = Alxc,k yl, c,k . { } − { } Self-Collaboration The uncertainty reduction by (5.63) is w.r.t. a target described by − T the relation on the error terms zl1, c,k . The terms zl, c,k ∑l1=l,l1 1,..,L zl1, c,k act as a self- { } { } ̸ ∈{ } { } regularization that tries to compensate for the unknown distribution of zl, c,k by reducing the { } M uncertainty w.r.t. the linear combination cl, c,k = ∑l1=l,l1 1,..,L zl1, c,k ℜ of the rest of { } ̸ ∈{ } { } ∈ unknown distributions for zl1, c,k . The goal is to allow the individual zl, c,k to describe only { } { } a portion of the targeted, but, unknown probability distribution of zl, c,k in order to reduce { } the uncertainty and give a reliable and robust estimate w.r.t. a composition and aggregation function that in our case is simply a concatenation [y1, c,k ,...,yL, c,k ]. { } { } Prior on zl The transform representation yl, c,k = TPl,c (xc,k) takes into account a − { } nonlinearity and Alxc,k cl,c is only seen as its linear approximation. In the simplest form − we model zl, c,k to be a Gausian distributed. Additional knowledge about zl, c,k can be { } { } used in its modeling.

Prior on Al We assume that: − L L p(A) = ∏ p(Al) ∝ ∏exp( Ω(Al)). (5.68) l=1 l=1 − 112 Learning Robust and Discriminative NT Representation for Image Recognition

The prior on Al penalizes the information loss in order to avoid trivially unwanted matrices Al, i.e., matrices that have repeated or zero rows. This prior measure Ω(Al) is defined equivalently as in section 5.2 and similarly, we used it to regularize the conditioning and the

expected coherence of Al.

Discriminative Prior A joint probability p(θ ,Y c,k ) expresses the unsupervised discrimi- { } native prior, where we assume that:

θ = θ 1,...,θ L , where { } θ l = θ l,1,θ l,2 , l 1,...,L , { } ∀ ∈ { } (5.69) θ l,1 = τ l,1,...,τ l,C , { 1 } θ l,2 = ν l,1,...,ν l,C , { 2 } cover similarity and dissimilarity regions in the transform space and take into account a form of collaboration. This allows us to explicitly model dependencies between:

(i) θ and Y c,k , where additional self-collaboration component can be introduced { } (ii) θ and yl, c,k where each yl, c,k , independently collaborate with θ { } { } (iii) θ l and yl, c,k , where only the relation between θ l and yl, c,k is considered. { } { } In this work, we consider the simplest case and model our prior by considering only

dependences between θ l and yl, c,k , i.e: { } L p(θ ,Y c,k ) =∏ p(θ l yl, c,k )p(yl, c,k ). (5.70) { } l=1 | { } { }

Furthermore, in our independent modeling per θ l and yl, c,k we consider that: { } 1 p(θ l yl, c,k ) ∝ exp lI(θ l,yl, c,k ) , (5.71) | { } −β { }  I 

where lI(θ l,yl, c,k ) is a discriminative measure for similarity contributions over the parame- { } ters θ and: 1 p(yl, c,k ) ∝ exp yl, c,k 1 , (5.72) { } −β ∥ { }∥  l,1  is a sparisty inducing prior, while βl,1 and βI are scaling parameters. The use of the measure lI(θ l,yl, c,k ) allows us to introduce a concept that relies on collaboration corrective covering { } of similarity and dissimilarity regions by taking into account a relations between θ l and Y c,k . { } Prior Measures To define lI(θ l,yl, c,k ) it is assumed that: − { } (i) p(θ ) = p(θ )p(θ ) = Cd p(τ ) Cs p(ν ), l l,1 l,2 ∏c1=1 l,c1 ∏c2=1 l,c2 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 113

(ii) lI(θ l,yl, c,k ) is determined by relations on the support intersection of vectors yl, c,k , { } { } τ l,c1 and ν l,c2 , and (iii) the description is decomposable w.r.t. τ l,c and ν l,c , c1,c2 1,...,Cd , 1,..., 1 2 ∀{ } ∈ {{ } { Cs . }} We use the two prior measures as in the Subsection 5.2, but now we define them + + w.r.t. y , i.e., ρ(y ,y ) = y− y− 1 + y y 1 and l, c,k l, c,k l, c1,k1 l, c,k l, c1,k1 l, c,k l, c1,k1 { } { } { } ∥2 { } ⊙ { }∥ ∥ { } ⊙ { }∥ ς(yl, c,k ,yl, c1,k1 ) = yl, c,k yl, c1,k1 2. The motivation is again similar as in Section { } { } ∥ { } ⊙ { }∥ 5.2. We use these measures in order to impose a discrimination constraint without any explicit assumption about the space/manifold in the transform domain.

We note that a discriminative assignment w.r.t. to the parameters θ l that describe regions of similarity and dissimilarity in a collaborative way represents a trade-off between three elements: (i) Similarity contribution (ii) Dissimilarity contribution (iii) Uncertainty corrective contribution w.r.t. a measure.

Our min-max functional lI(θ l,yl, c,k ) is one particular form that can be used to measure { } the score of this trade-off that we define as follows:

lI(θ l,yl, c,k ) = { } (5.73) min max (ρ(yl, c,k ,τ l,c1 ) + ρ(yl, c,k ,ν l,c2 ) + ς(yl, c,k ,τ l,c1 )), 1 c1 C1 1 c2 C2 { } { } { } ≤ ≤ ≤ ≤

By assumption τ l,c1 and ν l,c2 are spread far apart and cover the corresponding similarity and dissimilarity regions in the transform domain. The min-max cost lI(θ l,yl, c,k ) ensures { } that yl, c,k in the transform domain will be located at the point where (i) the dissimilarity { } contribution w.r.t. τ l,c1 is smallest measured w.r.t. ρ, (ii) the strength of the support intersec- tion w.r.t. τ l,c1 is smallest measured w.r.t. ς, and (iii) the similarity contribution w.r.t. ν l,c2 is largest measured w.r.t. ρ. 114 Learning Robust and Discriminative NT Representation for Image Recognition

5.3.5 Joint Learning of NTs with Discrimination Specific Self-Collaboration

Minimizing the exact negative logarithm of our self-collaborative learning model:

p(Y,A X) = p(Y X,A)p(A X) = | | | C K ∏ ∏ p(Yl, c,k ,θ xc,k,A)dθ p(A xc,k) ∝ θ { } | | (5.74) c=1 k=1 Zθ  C K ∏ ∏ p(xc,k Yl, c,k ,A)p(θ Yl, c,k )p(Yl, c,k )dθ p(A xc,k), θ | { } | { } { } | c=1 k=1 Zθ  over Y, θ and A is difficult since we have to integrate in order to compute the marginal and the partitioning function of the discrimination prior. Similarly as in the previous subsection, instead of minimizing the exact negative logarithm

of the marginal probability p(Yl, c,k xc,k,A) = θ p(xc,k Yl, c,k ,A)p(θ Yl, c,k )p(Yl, c,k )dθ , { }| | { } | { } { } if we consider minimizing the negative logarithmR of its maximum point-wise approximation over Y,θ and A we end up with the following problem formulation:

Yˆ ,θˆ ,Aˆ = { } L C K 1 2 arg min ∑ AlX Yl F + ∑ ∑ λl,0lI(θ l,yl, c,k ) + λl,1 yl, c,k 1 + Y,θ ,A 2∥ − ∥ { } ∥ { }∥ (5.75) l=1 c=1 k=1   1 T Tr (AlX Yl) (Al X Yl ) +Ω(Al) , L − ∑ 1 − 1 " l1 1,...,L l # ) ∈{ }\

where, Y = [Y1,...,YL], Yl = [yl, 1,1 ,...,yl, C,K ] and λl,1 and λl,0 are inversely proportional { } { } to the scaling parameter β1,0 and βI. As we will show later on, (5.76) is more appealing not only due to the simplification, but also due to the possibility for an efficient solution and interpretations.

Approximative Discrimination Prior Likelihood in Implicit Form Using a simple encod- − ing function (as in the previous Sections 5.1 and 5.2 and Appendix C.3 and C.4 ) and all of the

NT representations we can implicitly express the parameters θ l = θ l,1,θ l,2 , such that they { } are eliminated from (5.75). To do so, we first note that we use an empirical approximation, i.e.,

1 C K E[ log p(θ l yl, c,k )] E[lI(yl, c,k ,θ l)] ∑ ∑ lI(yl, c,k ,θ l), (5.76) − | { } ≃ { } ∼ CK c=1 k=1 { } 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 115 that can also be expressed as:

1 Pl Pl Pl E[lI(yl, c,k ,θ l)] LI(Yl) = Dℓ (X) Rℓ (X) + Sℓ (X) , (5.77) { } ∼ CK 1 − 1 1   where per each Y , DPl (X), RPl (X) and SPl (X) are defined equivalently as in the previous l ℓ1 ℓ1 ℓ1 Section 5.2 (the proof is also equivalent to the one given in Appendix C.7). Since we would like to estimate the model parameters that minimize the discriminative objective (5.77) we end up with the following problem formulation: L K K ˆ ˆ 1 2 Y,A = argmin AlX Yl F + λl,0LI(Yl) + λl,1 yl, c,k 1 + , ∑ ∑ ∑ { } A Y l=1 2∥ − ∥ c=1 k=1 ∥ { }∥  (5.78) 1 T Tr (AlX Yl) (Al X Yl ) + Ω(Al) . L − ∑ 1 − 1 " l1 1,...,L l # ) ∈{ }\ We highlight that for our model, the solution to (5.78) is not equivalent to the maximum a posterior (MAP) solution4, which would be difficult to compute, as it involves integration over xc,k, yl, c,k and θ . Instead, we perform an integrated marginal minimization that is addressed { } C K with (5.78) and solved by iteratively marginally maximizing ∏c=1 ∏k=1 p(xc,k,Y c,k ,θ ,A) { } in A and Y c,k . This is equivalent to 1) maximizing the conditionals p(xc,k Y c,k ,A) with { } | { } an approximation to the prior p(θ l yl, c,k ) over yl, c,k and 2) approximatively maximizing C K | { } { } L the conditional ∏c=1 ∏k=1 p(xc,k Y c,k ,A) with prior p(A) = ∏l=1 p(Al) over A. | { }

5.3.6 The Learning Algorithm

As a solution to (5.78), we propose an iterative, alternating algorithm with two distinct stages:

(i) NT representation yl, c,k estimation and { } (ii) linear map Al estimation. At the same time, we show that: 1) the problems at the corresponding stages have exact and an approximate closed form solution, respectively, and 2) algorithm convergence.

Representation yl, c,k Estimation Given the available data samples X and the current esti- { } mate Al, the discriminative representation estimation problem per Yl is decoupled. Moreover,

4The MAP estimation problem for our model is identical to (5.78), but has additional terms that are related to the partition functions of p(xc,k Yc,k, A) and p(θ ,Y c,k ). | { } 116 Learning Robust and Discriminative NT Representation for Image Recognition

given all Yl except yl, c,k , denote: { } 1 c = (y A x ), l, c,k ∑ l1, c,k l1 c,k { } L { } − l1 1,...,L l (5.79) ∈{ }\ q =ql, c,k = Alxc,k + cl, c,k , { } { }

then for any y = yl, c,k , problem (5.78) reduces to a constrained projection: { }

1 2 T T yˆ = argmin q y + p y + λl,0s (y y). (5.80) y 2∥ − ∥2 | | ⊙

Assuming that pT y 0 it has a closed form solution as: | | ≥ y =sign(q) max( q p,0) n, (5.81) ⊙ | | − ⊘

where: p =λl,0(g v) + λl,11, − n = 1 + 2λl,0s , (5.82) + g =sign (max(q,0)) d + sign(max( q,0)) d−, ⊙ 1 − ⊙ 1 + v =sign(max(q,0)) d + sign(max( q,0)) d−. ⊙ 2 − ⊙ 2 + + We note that d1−,d1 , d2−,d2 and s are constructed similarly as in the previous section using + + (5.54), but now different d−,d , d−,d and s have to be estimated for all l 1,...,L ,c 1 1 2 2 ∈ { } ∈ 1,...,C ,k 1,...,K (the proof is equivalent to the one given in Appendix C.7). { } ∈ { } Let yl, c,k be given, we denote its cumulative cost w.r.t. the discrimination and collabo- { } ration corrective parameters as:

T T T ξl, c,k = p yl, c,k + cl, c,k yl, c,k + s (yl, c,k yl, c,k ), (5.83) { } | { }| { } { } { } ⊙ { }

then the empirical expectation over ξl, c,k , i.e.: { } 1 C K E[ξl, c,k ] ∑ ∑ ξl, c,k , (5.84) { } ≃ CK c=1 k=1 { }

can be seen as an empirical risk for the proposed NT model w.r.t. the used self-collaboration component and discrimination prior measure. Therefore, we can also say that the learning objective of (5.78) is to estimate the model parameters that minimize the empirical risk (5.84). 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 117

AR YALE B COIL20 NORB 1 L L ∑l=1 µ(Al) 2.1e-4 1e-4 1.9e-4 3.1e-4 1 L L ∑l=1 κn(Al) 16.1 26.3 18 19.1 1 Table 5.9 The cumulative expected mutual coherence L ∑l µ(Al) and the cumulative condi- 1 tioning number L ∑l κn(Al) for the linear maps Al,l 1,...,6 with dimensions 6570 N, where N is the dimensionality of the input data ∈ { } ×

Linear Map Al Estimation Given the data samples X, all Y = [Y1,...,YL], and all A except 1 Al, denote YT = Yl L ∑l1 1,...,L l(Al1 X Yl1 ), then the problem related to the estimation − ∈{ }\ − of the linear map Al, reduces to:

ˆ 1 2 λl,2 2 λl,3 T 2 T A = argmin AlX YT 2 + Al F + AlAl I F λl,4 log detAl Al , (5.85) Al 2∥ − ∥ 2 ∥ ∥ 2 ∥ − ∥ − | | where λl,2,λl,3,λl,4 are inversely proportional to the scaling parameters βl,3,βl,4,βl,5 { } { } and we use approximate closed form solution as in Chapter4, Section 5.2 (also Appendix B.1 and B.2).

Convergence Due to the approximate closed form solution that preserves the gradient in the linear map update (Chapter4, Appendix B.1 and B.2) and the exact closed form solution in the NT representation estimation step (Appendix C.7), the proof derivation about convergence of our algorithm is similar to the one give in the previous section 4 from Chapter5. The difference is that instead of one linear map and NT representation, we have L linear maps and L NT representations, which we estimate them using approximate and exact close form solutions. In addition, the condition related to the discrimination objective can be loosened up more, because in (5.78) the discrimination objective is a sum of the discrimination objectives L per each of the l nonlinear transforms, i.e., Lp(Y) = ∑l=1 LI(YI), while we use an analogous reasoning as for the proof given in Chapter5, section 4. In this section, again, by proving convergence of our algorithm, we can claim that under the approximation of θ , the algorithm allows us to find a joint local maximum in A and C K Y for p(Y,A X) = p(Y X,A)p(A X) = ∏c=1 ∏k=1 θ p(Yl, c,k ,θ xc,k,A)dθ p(A xc,k) | | | { } | | such that the discrimination and collaboration specificR prior probability is maximized.

5.3.7 Evaluation of the Proposed Approach

In this section, we evaluate the algorithm properties, the discriminative quality, and the recognition accuracy. 118 Learning Robust and Discriminative NT Representation for Image Recognition

AR YALE B COIL20 NORB learning time [h] .612 .901 .333 .712 Table 5.10 The learning time in hours on the databases AR, YALE B, COIL20 and NORB using our model with dimension LM = 6570, number of self-collaboration components L = 9, and dimension per self-collaboration component M = 730 .

Quantifying a Discrimination Quality The discriminative properties of a data set under a A1 . LM N LM transform with parameter set Pt = A = . ℜ ,τ1 ℜ are defined identically { . ∈ × ∈ } " AL # as in the previous section, the only difference is now how we define the two concentrations DPt (X) and RPt (X). Using the measures (5.44) defined over all of the vectors y = ℓ1 ℓ1 c,k T T T [yl, c,k ,...,yL, c,k ] , we define the discrimination power as: { } { }

I t = log(RPt (X)) log(DPt (X) + ε), (5.86) ℓ1 − ℓ1 where ε > 0 is a small constant. In this following text, we first state the algorithms set-up, then we explain our numerical simulation and finally we provide the results.

Data Sets and Algorithm Setup The used data sets are AR [96], Extended YALE B [47], COIL20 [105], NORB [83], MNIST [82], F-MNIST [158] and SVHN [106]. All the images from the respective datasets were downscaled to resolutions of 32 28, 21 21, 20 25, × × × 24 24, 28 28 and 28 28, and are normalized to unit variance. × × × An on-line variant is used for the update of A w.r.t. a subset of the available training set. It has the following form Aˆ t+1 = At ρ(At At+1) where At and At+1 are the solutions − − in the transform update step at iterations t and t + 1, which is equivalent to having the additional constraint At Aˆ t+1 2 in the related problem. The used batch size is equal to ∥ − ∥F 87%,85%,90%, 87%, 5%, 5% and 5% of the total amount of the available training data from the respective data sets AR, Extended YALE B, COIL20, NORB , MNIST, F-MNIST and SVHN.

Considered Experiments In the numerical experiments we learn our model using the proposed algorithm. Then, we construct a sparsifying nonlinear transform (sNT) rep-

resentation using our learned model by (i) computing sparsifying transforms ul, c,k = T T{ } T sign(Alxc,k) max( Alxc,k τ1,0) and (ii) concatenating them uc,k = [u ,...,u ] . ⊙ | |− 1, c,k L, c,k The numerical experiments are performed using the sNT representations{ } and consist{ } of three parts. 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 119

AR YALE B COIL20 NORB I o 2.13 1.45 1.18 0.41 I R 2.41 1.66 1.61 0.40 I S 2.71 1.76 1.92 0.40 I ∗ 3.04 2.14 2.63 0.42 Table 5.11 The discrimination power in the original domain, after random transform, after learned sparsifying transform and after learned self-collaborating target specific nonlinear transform with dimension M = 6570.

k-nn Acc. AR YALE B COIL20 NORB raw 96.1 95.4 96.8 97 p 97.1 97.1 97.8 96.8 Table 5.12 The recognition results on the databases AR, YALE B, COIL20 and NORB, using k-NN on the raw image data (raw) and the sparse representations from our model (p) with dimension M = 6570.

Model Properties In the first series of the experiments, we evaluate the cumulative − L expected mutual coherence ∑l=1 µ(Al), where µ(Al) is computed as defined by (5.31) (or L as defined in Chapter4). Also, we evaluate the cumulative conditioning number ∑l=1 κn(Al), κ (A ) = σmax , σ and σ are the smallest and the largest singular values of A , respec- n l σmin min max l tively and the computational efficiency as run time t[h]. Discrimination Power of the sNT Representation A comparison is presented between − the discrimination power I under different transforms. The discrimination power is esti- mated in the original domain I o, after transform by a Gaussian random matrix I R and after a learned nonlinear transform having transform dimension M = 6570 without only sparsity S prior I and with discriminative and sparsity priors I ∗. Recognition Accuracy Comparison Proposed vs State-of-the-art The third part evalu- − ates the discrimination power and the recognition accuracy using the representations from our model as features and compares it to several state-of-the-art methods, including 1) Supervised dictionary learning methods [118], [162], [148] and [149] 2) Unsupervised feature learning methods [106], [63], [109] 3) Different classifiers [157] 4) Deep neural networks [50], [93], [85], [139]. This comparison considers a setup where the used data are divided into a training and test set. The learning is performed on the training set and the evaluation is performed on the test set. The training sNT representations are used to estimate the SVM parameters and the recognition is performed using the learned SVM on the test sNT representations. 120 Learning Robust and Discriminative NT Representation for Image Recognition

YALE BMNIST YALE B MNIST II Acc. [%] Acc.[%] DLSI [118] 0.71 0.67 96.5 98.74 FDDL [162] 0.87 0.63 97.5 96.31 COPAR [148] 0.57 0.54 98.3 96.41 LRSDL [149] 0.42 0.40 98.7 p 0.90 0.81 k-nn 97.1 k-nn 97−.32 p 0.90 0.81 l-svm 98.8 l-svm 98.45 HOG-p 1.20 1.11 l-svm 99.2 l-svm 99.10 Table 5.13 The discrimination power and the recognition results on the Extended Yale B and MNIST databases for the methods DLSI[118], FDDL [162], COPAR [148], LRSDL [149], the proposed model on raw image data p and the proposed model on extracted HOG [32] image features HOG-p.

Results We present the results in the following. 1 L Model Properties The cumulative conditioning number L ∑l=1 Cn(Al) and the cumu- − 1 L lative expected coherence L ∑l=1 µ(Al) for the learned transforms using the databases AR, YALE B, COIL20 are shown in Table 5.9. All linear maps per all the databases have good conditioning numbers and low expected coherence. This confirms the effectiveness of the conditioning and the coherence constraints. The running time t[h], measured in hours for learning the model parameters with M = 6570 are shown in Table 5.10. The learned trans- forms for all the data sets have relatively low execution time, despite the very high transform dimension. Discrimination Quality In Table 5.11 we show results. The discrimination power I − ∗ is significantly increased in the transform domain compared to the one in the original domain I O and is higher than I R and I S. Unsupervised Classification Recognition Accuracy Table 5.12 shows the recognition − results on the databases AR, YALE B, COIL20 and NORB using k-nn as a classifier, where we compare the results w.r.t. to the baseline, that is a k-nn in the original domain and we see improvements. Whereas for the results shown on Tables 5.13 and 5.14 we see comparable results. Proposed vs State-Of-The-Art Considering the evaluation of the discrimination power, − in all the algorithms, the dictionary size (transform dimension M) is set to be equal to 150,300 for the used databases, respectively. The discrimination power is compared with { } the methods DLSI, FDDL, COPAR and LRSDR. The results are shown in Table 5.13. The discrimination power of the sNT representation is higher than the discrimination power of the comparing methods. 5.3 Unsupervised Learning of Discrimination Specific, Self-Collaborative and NT Model for Image Recognition 121

MNIST F-MNIST SVHN Method Acc. Method Acc. Method Acc. LIF-CNN [63] 98.37 lOG-REG [157] 84.00 SSAE [106] 89.70 S-CW-A [93] 98.62 RF-C [157] 87.70 C-KM [106] 90.60 REG-L [109] 99.08 SVC [157] 89.98 S-CW-A [93] 93.10 F-MAX [50] 99.65 CNN [139] 92.10 TMA [85] 98.31 k-nn 97.11 k-nn 88.10 k-nn 86.41 l-svm 99.10 l-svm 92.22 l-svm 90.28

Table 5.14 Recognition accuracy comparison between sota and 1) k Nearest Neighbor (k-nn) search and 2) linear SVM [61](l-svm) that use the Sparsifying NT (sNT) representations from our model on extracted HOG [32] image features. We use our algorithm to learn the model on the HOG features. Then we get the sNT representations with dimensionality 9800 for the respective training and test sets. Considering the obtained result for database SVHN, we note that the unlabeled training data from the respective database was not used during the learning of the corresponding model.

For the databases YALE B and MNIST the recognition accuracy is comparable and higher, respectively, w.r.t. the state-of-the-art and the DDL methods. The results shown in Table 5.14 demonstrate improvements and competitive performance w.r.t. the comparing unsupervised feature learning methods. Considering the comparison w.r.t. deep neural networks, [50], [139] and [85] achieve the highest accuracy on MNIST, F-MNIST and SVHN. We highlight that although we learn a target specific self-collaboration model with discriminative and sparsity priors, during testing we use a simple sNT representation. Whereas in most of the deep neural networks [63], [93], [109], [50], [139], [106] and [85], the modeling by a multi layer architecture, the local image content and/or aggregation and nonlinearities were taken into account. 122 Learning Robust and Discriminative NT Representation for Image Recognition

5.4 Summary

In the first subsection, we considered the face recognition problem from both a machine learn- ing and information coding perspective, adopting an alternative way of visual information encoding and decoding. Our model for recognition was based on multilevel vector quanti- zation (MVQ) that is conceptually equivalent to bag-of-words while the multilevel scheme bears similarity to CNN. A key ingredient in our approach was the robust estimate of the local NT representation, where we gave a generalization result by a solution of a constrained projection problem and showed the correspondence of the sparse likelihood approximation to maximum a posterior estimation. In addition, we provided a probabilistic interpretation to the hard thresholding, soft thresholding and ℓ1 -norm ratio constrained likelihood approximation. ℓ2 The computer simulation demonstrated improvements and competitive performance of the proposed method over the sparse representation based recognition on several popular face image databases in terms of recognition accuracy, computational complexity and memory usage. In the second subsection, we presented a novel approach for learning discriminative and sparse representations for image recognition. A novel discriminative prior was proposed and the properties of the models with the prior were studied. A low complexity learning algorithm was presented. The obtained results w.r.t. the introduced measures and the recognition accuracy on the used databases showed promising performance. Moreover, it was highlighted how the loss of information reflects the similarity concentrations when expanding to high dimensional space with a nonlinear transform. A study on the recognition capabilities on other databases are our next steps. An extension considering the maximization of the discrimination power in the transform domain, under supervised and unsupervised case is left for future work. In the third subsection, we introduced a novel collaboration structured model with mini- mum information loss, collaboration corrective and discriminative priors for the joint learning of multiple nonlinear transforms for an image recognition task. The model parameters were learned by addressing an integrated marginal maximization that corresponds to minimizing an unnormalized empirical log likelihood of the model. An efficient solution was proposed by an iterative, coordinate descend algorithm with convergence guarantees. The preliminary results w.r.t. the introduced measure and the recognition accuracy on the used databases showed promising performance and advantages w.r.t. state-of-the-art methods. Chapter 6

Clustering with NT and Discriminative Min-Max Assignment

Clustering is one of the most important unsupervised learning tasks in the areas of signal processing, machine learning, computer vision and artificial intelligence that has been extensively studied for decades. Commonly, data clustering algorithms [27], [60], [52], [69], [19], [138], [160], [7] and [80] address the problem of identification and description of the underlining clusters that explain the data in the original data domain. Among the various types of clustering algorithms, the k-means and matrix decomposition based methods are one of the most popular and practically useful approaches. Given a data set, in the most common case, the objective of a clustering algorithm is to minimize the with-in-cluster cost, i.e., the measured similarity between the data cluster and the data points that belong to that cluster and maximize the out-of-cluster cost, i.e., the measured similarity between the data cluster and the data points that do not belong to that cluster. A data factorization/decomposition model [27][60][80][146] with constraints summa- rizes a general problem formulation that also subsumes the previously explained basic case. We express it as:

CK Yˆ ,Dˆ = argmin g(xi,Dyi)+λ0 fs(yi)+λ1 fc(yi,θ )+λ2 fd(D), (6.1) , ∑ { } Y D i=1

N C N C where D = [d1,...,dC] ℜ are the clusters, xi ℜ is the i-th data point, yi ℜ is ∈ × ∈ ∈ its sparse data representation, θ are parameters responsible for a task specific functionality, g(.,.) is the similarity measure between the data points and the clusters, fc(.,.) and fs(.) are the task specific and sparsity penalty functions, fd(.) is a penalty on the cluster properties and λ0,λ1,λ2 are Lagrangian parameters. { } 124 Clustering with NT and Discriminative Min-Max Assignment

Usually, the cluster assignment in such a factorization/decomposition (6.1) boils down

to solving an inverse problem, where xi is reconstructed and represented by a sparse linear combination yi over the clusters D as xi Dyi. The crucial element in any clustering, which is ≃ based on a point to cluster assignment score, is the used measure g(.,.) for similarity between

that data point and the data clusters. In addition, the penalty functions fc(.,.), fs(.) and fd(.) play a significant role, e.g., structured sparsity, pairwise constraints, subspace modeling, graph induced penalty and manifold curvature preservation, in the cluster estimation and impact the resulting cluster assignment. In the past, although many data clustering measures were studied, little attention was dedicated to measures that reflect a notion about a joint similarity and dissimilarity score between a data point and a set of data points. Data assignment concepts that allow unsuper- vised discrimination and clustering not necessarily in the original data domain, but instead in some nonlinear transform domain, where a transform model is used, were not addressed. The main open issues are the data model and the appropriate priors, which delimit the problem formulation and the definition of a suitable objective. In addition, the data variabilities and uncertainties make the definition of discrimination even more challenging.

6.1 Approach Outline and Contributions

In this section, we address some of the previously mentioned open issues and challenges. In order to produce a sparse and discriminative NT representation that is useful for recognition applications as well as to enable data clustering in a transform domain based on a similar- ity/dissimilarity measure, we study the joint modeling and learning of multiple nonlinear transforms (NTs).

6.1.1 Joint Modeling of NTs with Priors

Our joint modeling of NTs addresses the problem of estimating the parameters that model M N the probability p(yi xi,A) = θ p(yi,θ xiA)dθ , where A ℜ × is the linear projection M | | ∈ map, yi ℜ is the assigned NT representation and θ are NT specific parameters. Af- ∈ R ter applying the linear mapping Axi, the parameter pair τ c1 ,ν c2 from θ = θ 1,θ 2 = M (C +C ) { } { } τ 1,...,τC , ν 1,...,νC ℜ d s are used for the generalized element-wise non- {{ d } { s }} ∈ × linearity to get the corresponding NT representation and play a key role in describing our formal notion for discrimination as well as the cluster assignment setup. CK As a learning model, we consider p(Y,A X) = p(Y X,A)p(A X) = ∏ p(yi,θ xiA)dθ | | | i=1 θ | p(A xi). Its maximization over Y,θ and A is difficult. Therefore, we address a point-wise | R 6.1 Approach Outline and Contributions 125

approximation to the marginal p(yi,θ xiA)dθ , which allows us to derive an efficient θ | algorithm for model parameter estimationR and clustering. The fundamental difference of our joint modeling of multiple NTs that we address with our learning and clustering algorithm compared to the factorization/decomposition based clus- tering methods (6.1) consists in the used data model. The factorization/decomposition model addresses the problem of joint data reconstruction and cluster estimation with constraints. Our nonlinear transform modeling addresses the problem of joint learning of discrimina- tive data projections, i.e., nonlinear transforms, and linear map estimation with constraints without any explicit data reconstruction.

6.1.2 Simultaneous Cluster and NT Representation Assignment

In our cluster assignment, we do not rely on the usual description with clusters. Instead, we rely on NTs. Compactly, we estimate a single NT representation as a solution to a direct problem that represents a constrained projection problem. The NTs have similar role as the clusters, the difference is that for xi we have one representation w.r.t. the clusters as in (6.1). Instead, we apply multiple NTs to a data point xi, that results in different NT representations of xi that depend on the discrimination parameters θ of our NTs. A common clustering assignment that can be derived from (6.1) is based on the similarity score between a point and the clusters. Analogously, we introduce an assignment based on min-max similarity/dissimilarity score over the estimated NTs representations and the corresponding discrimination parameters from the NTs. It represents one form of approx- imation to a discrimination log likelihood. What is interesting to note is that the actual NT representation is also obtained in the form of an assignment from the set of available NT representations. That is, we simultaneously assign both the cluster and estimate the representation yi.

6.1.3 Contributions

In the following, we outline our main contributions: (i) We introduce a novel discriminative clustering and assignment principle that is centered around two elements: (1) a joint modeling and learning of nonlinear transforms (NTs) with priors and (2) simultaneous cluster and NT representation assignment based on an approximation to a discrimination prior log likelihood. To the best of our knowledge, our novel discriminative assignment principle is the first of this kind that: (a) Addresses a trade-off between robustness in cluster assignment and the NT representaion compactness by allowing a reduction or extension of the NT di- 126 Clustering with NT and Discriminative Min-Max Assignment

mensionality while increasing or decreasing the number of the discrimination parameters (b) Offers discriminative cluster assignment over a wide class of similarity score functions, including a min-max while enabling efficient estimation of the NT representation (c) Allows a rejection option and cluster grouping over continues, discontinues and overlapping regions in the transform domain.

(ii) We propose an efficient learning strategy that jointly estimates the parameters ofthe NTs. We implement it by an iterative alternating algorithm with three steps. At each step we give an exact and approximate closed form solution.

(iii) We present numerical experiments that validate our model and learning principle on several publicly available data sets. Our preliminary results on an image clustering task demonstrate advantages in comparison to the state-ofthe-art methods, w.r.t. the computational efficiency in training and test time and the used clustering performance measures.

6.2 Related Work

In the following, we describe the related prior work.

K-means, Subspace and Manifold Clustering The basic k-means [27] algorithm for clustering, considers only a "hard" assignment. In addition, with the intention to capture a nonlinear structure of the data with outliers and noise, the kernel k-means algorithms (KKM) [34] and [21] have been proposed. Many different subspace clustering methods were proposed [146], [92], [91], [87], [41] and [16]. Commonly they consist of (i) subspace learning via matrix factorization and (ii) grouping of the data into clusters in the learned subspace. Some authors [31] even include a graph regularization into the subspace clustering. In our approach we address the clustering in the transform domain using learned nonlinear transforms with priors.

Discriminative Clustering In [160] clustering with maximum margin constraints was pro- posed. The authors in [7] proposed linear clustering based on a linear discriminative cost function with convex relaxation. In [80] regularized information maximization was proposed and simultaneous clustering and classifier training was performed. The above methods rely on kernels and account high computational complexity. Our method learns to reduce or extend dimensionality through a nonlinear transform and simultaneously (i) estimates the 6.3 Joint Modeling of Nonlinear Transforms With Priors 127

d1

d6 d7

xi d8 d2

d5 d4 d3 Data Sample Clusters Assigned Cluster

Fig. 6.1 An illustration of the cluster assignment based on a similarity measure d(.,.) between xi and the clusters d j, j = 1,..,8 , i.e., jˆ = 7 = argmin j m(xi,d j). { } nonlinear transform representation and (ii) assigns the cluster by evaluating the likelihood of the discrimination prior.

Matrix Factorization Models and Dictionary Learning Factor analysis [20] and matrix factorization [60] rely on the decomposition on the hidden features, with or without con- straints. When discrimination constraints are present they act as regulizers that were mainly defined using labels in the discriminative dictionary learning methods (DDL)[52], [69], [19] and [138]. We address unsupervised cluster evaluation and assignment with discrimination constraints.

Self-Supervision, Self-Organization and Auto-Encoders In self-supervised learning [35], [112] the input data determine the labels. In self-organization [74], [145] a neighborhood function is used to preserve the topological properties of the input space. Both of the approaches leverage implicit discrimination using the data. In the commonly used single layer auto-encoders [10] and [147] robustness to noise and reconstruction was considered, which can also be considered as one form of the principle related to identification and description of the underling encoding elements that explain the data.

6.3 Joint Modeling of Nonlinear Transforms With Priors

We propose novel discriminative clustering and assignment principle that is centered around two elements: (i) a joint modeling and learning of nonlinear transforms (NTs) with priors and (ii) clustering and assignment based on approximation to a discrimination likelihood. 128 Clustering with NT and Discriminative Min-Max Assignment

ν y 2,2 2 ν2 |{ } y 1,2 |{ } qi =Axi qi

τ2 τ2 y 2,1 |{ } ν1 ν1 y 1,1 τ1 |{ } τ1

Pulling Force Candidate NTs: y c ,c , c1, c2 1, 2 |{ 1 2} ∈ { } Pushing Force Simultaneous assignment of: Linear Transform Representation cluster index & NT representation a) b)

Fig. 6.2 An illustration of the proposed simultaneous cluster and NT representation as-

signment. qi = Axi is the linear transform representation, y c1,c2 is NT representation, |{ } τ c ,ν c are element-wise nonlinearity parameters with discrimination role. There are in { 1 2 } total of 4 NT representations, determined by all pairs c1,c2 1,2 1,2 . Simultane- { } ∈ { } × { } ously, the data point xi is assigned to cluster index c = 2(c1 1) + c2 = (2 1)2 + 2 = 4 − − and the NT representation is estimated as yi = y 2,2 based on the discriminating min-max similarity/dissimilarity score. |{ }

6.3.1 Nonlinear Transforms Modeling

We consider the marginal probability:

p(yi xi,A) = p(yi xi,θ ,A)dθ (6.2) | θ | Zθ

where we model the nonlinear transform representation yi given the data representation xi and the parameters A and θ . Furthermore, we use the Bayesian rule and focus on the factor proportional relation:

p(yi,θ xi,A) ∝ p(xi yi,θ ,A)p(yi,θ A). (6.3) | | |

The probability p(xi yi,θ ,A) takes into account the NT errors and the discrimination param- | eter adjustment error. In addition, we simplify our discrimination prior as:

p(θ ,yi A) = p(θ ,yi), (6.4) | 6.3 Joint Modeling of Nonlinear Transforms With Priors 129 where: θ = θ 1,θ 2 , { } θ 1 = τ 1,...,τC , (6.5) { d } θ 2 = ν 1,...,νC . { s } Joint Nonlinear Transform and Parameters Adjustment Model Our nonlinear transform together with their corresponding discrimination parameters is modeled using p(xi yi,θ ,A) | and p(θ ,yi). We define p(xi yi,θ ,A) using two measures as follows: | 1 1 p(xi yi,θ ,A) ∝ exp( ur(Axi,yi) ua(Axi,θ )), (6.6) | −β0 − βa where ur(Axi,yi) and ua(Axi,θ ) take into account the NT and the discrimination parameter adjustment errors, respectively and β0 and βa are scaling parameters. Assignment Based Nonlinear Transform A compact description of our assignment − model can be defined as:

fθ : yi =y cˆ1,cˆ2 , wherec ˆ1,cˆ2 = argminsP(c1,c2), (6.7) | c1,c2 and sP(.) is a score over the discrimination prior measure, which we will explain in more details in the following text, while a single nonlinear transform is defined as:

Axi =yc ,c + zi, where : 1 2 (6.8) yc ,c =T (xi), 1 2 Pc1,c2

N M and T (xi) : ℜ ℜ is the parametric nonlinear function that produces yi, by using Pc1,c2 → the set of parameters Pc1,c2 . In this way, alternately, we also say that we model CdCs nonlinear transforms, TT =

TP ,..., TP , are defined by the corresponding set of parameters PT = P1,1,..., { 1,1 Cd,Cs } { PC ,C , where: d s } M N M M Pc ,c = A ℜ × ,τ c ℜ ,ν c ℜ , c1,c2 Cd Cs. (6.9) 1 2 { ∈ 1 ∈ 2 ∈ } ∀{ } ∈ × where Cd = 1,...,Cd and Cs = 1,...,Cs . All nonlinear transforms T in the set { } { } Pc1,c2 TT share the linear map A and have distinct τ c and ν c . A single T from the set 1 2 Pc1,c2 TT is indexed using the index pair c1,c2 or using the single index computed as c = { } c2 + (c1 1)Cs. − Note that yi is evaluated using an assignment over one of the NT representations y c,k that | result from applying the nonlinear transform T (xi) on Axi. Therefore, we can say that Pc1,c2 130 Clustering with NT and Discriminative Min-Max Assignment

the term zi = Axi yi is the nonlinear transform error vector that represents the deviation of − Axi from the targeted transform representation yi. In the simplest form, we assume zi to be Gaussian distributed and we model:

2 ur(Axi,yi) = Axi yi . (6.10) ∥ − ∥2 NT Parameters Adjustment We assume that the adjustment of any of the NT discrimi- − nation parameters is w.r.t. the measure ua(.) that we define as:

2 ua(Axi,θ ) = minmin Axi τ c1 ν c2 2. (6.11) c1 c2 ∥ − − ∥

As we will see in the latter subsection, this measure coupled with the discrimination parameter

prior measure enforces the parameters τ c1 and ν c2 to decompose the linear projections Axi into two distinct parts, one related to similarity and the other to dissimilarity. It is crucial for a proper adjustment to the linear mapping and the ability of the NT to discriminate in the transform domain based on θ .

6.3.2 Priors Modeling

A prior p(A) is used to allow adequate regularization of the coherence and conditioning [79],

[125] and [122] on the transform matrix A, whereas the joint modeling of the CsCd NTs is enabled by using the prior p(θ ,yi).

Minimum Information Loss Prior We have a prior on A, i.e., p(A) ∝ exp( Ω(A)), that − we explained in details in Chapter4. This prior penalizes the information loss in order to avoid trivially unwanted matrices A, i.e., matrices that have repeated or zero rows by regularizing the conditioning and the expected coherence of A.

Discrimination Prior The discrimination prior is modeled as:

1 1 1 p(θ ,yi) ∝ exp fc(θ ,yi) up(θ ) yi 1 , (6.12) −β − β − β ∥ ∥  d E 1 

where fc(θ ,yi) is a NT representation and up(θ )is NT parameter discrimination measures, respectively, while yi 1 is our sparsity measure and β1,βd and βE are scaling parameters. ∥ ∥ NT Representation Discrimination Measures To define fc(θ ,yi) we assume that: − (i) The relation between θ and yi is determined on the support intersection between yi,

τ c1 and ν c2

(ii) The min-max description is decomposable w.r.t. τ c1 and ν c2 6.3 Joint Modeling of Nonlinear Transforms With Priors 131

(iii) The support intersection relation is specified based on two measures defined onthe support intersection. We use the same two measures ρ and ς that were introduced in Section 5.2, but now + + 2 we define them w.r.t. yi as ρ(yi,y j) = yi− y−j 1 + yi y j 1 and ς(yi,y j) = yi y j 2. + + + ∥ ⊙ ∥ ∥ ⊙ ∥ ∥ ⊙ ∥ where yi = y y−,y j = y y−, y = max(yi,0) and y− = max( yi, 0). i − i j − j i i − Based on the above (i), (ii) and (iii), a functional fc(yi,θ ) is defined as follows:

ρ(yi,τ c1 ) fc(yi,θ )=min + ς(yi,τ c1 ) , (6.13) c1 max ρ(y ,ν )  c2 i c2 

The measure (6.13) ensures that yi in the transform domain will be located at the point where: The dissimilarity contribution w.r.t. τ c is the smallest measured w.r.t. ρ(.) − 1 The strength of the support intersection w.r.t. τ c is the smallest measured w.r.t. ς(.) − 1 The similarity contribution w.r.t. ν c is the largest measured w.r.t. ρ(.). − 2 To add to the understanding of the prior dynamics we explain it as a physical field of conflicting forces ρ(yi,τ c1 ) and ρ(yi,ν c2 ) that act on the point (transform representation) yi. The force minc1 ρ(yi,τ c1 ) with the smallest contribution for similarity measured by ρ(.) is pushing away from τ c1 and the force maxc2 ρ(yi, ν c2 ) with the largest contribution for similarity measured by ρ(.) is pulling towards ν c2 . The score fc(yi,θ ) is the equilibrium between the strongest force that pulls and strongest force that pushes. An illustration is given in Figure 6.2 b). Parameters Discrimination Measure The measure up(θ ) is defined as: −

Cd Cs u (θ ) = f (τ , θ ) + f (ν , θ ), p θ ∑ c τ c1 θ c1 ∑ c ν c2 θ c2 (6.14) \ \ c1=1 c2=1 where: θ = θ 1,θ 2 , while: { } (6.15) θ c1 = τ 1,...,τ c1 1,τ c1+1,...,τCd ,θ 2 , \ {{ − } }

θ c2 = θ 1, ν 1,...,ν c2 1,ν c2+1,...,νCs . \ { { − }} The advantage of using (6.14) is that: (i) it allows non-uniform cover of the transform space in arbitrarily coarse or dense way, (ii) it gives a possibility to represents a wide range of transform space regions, including non-continues, continues and overlapping regions (iii) at the same time it enables θ to describe and to be concentrated on the most important part of the transform space related to discrimination. 132 Clustering with NT and Discriminative Min-Max Assignment

6.4 Problem Formulation and Learning Algorithm

Minimizing the exact negative logarithm of our learning model p(Y,A X) = p(Y X,A)p(A X) = CK | | | ∏ [ p(yi,θ xi,A)dθ ] p(A xi) over Y,θ and A is difficult since we have to integrate in i=1 θ | | order toR compute the marginal and the partitioning function of the discrimination prior.

6.4.1 Problem Formulation

Instead of minimizing the exact negative logarithm of the marginal p(yi,θ est xi,A)dθ est, θ est | we consider minimizing the negative logarithm of its maximumR point-wise estimate, i.e., p(yi,θ est xi,A)dθ est Dp(yi,θ xi,A), where we assume that θ are the parameters for θ est | ≤ | which p(yi,θ est xi,A) has the maximum value and D is a constant. Furthermore, we use the R | proportional relation (6.3) and by disregarding the partitioning function related to the prior (6.4), we end up with the following problem formulation: Yˆ ,Aˆ ,θˆ = { } log p(xi yi,θ ,A) − | CK 1  2 arg min ∑ Axi yi 2+ λ2ua(Axi,θ )+ Y,A,θ i= 2∥ − ∥ 1 z }| { (6.16)    log p(yi,θ ) log p(A) − − λ0 fc(yi,θ ) + λ1 yi 1 + λEup(θ ) + fd(A) , ∥ ∥  z }| { z }| {   where 2,λ0,λ1,λ2,λE are parameters inversely proportional to β0,βd,β1,βa,βE . { } { }

6.4.2 The Learning Algorithm

Note that, solving (6.16) jointly over A, θ and Y is again challenging. Alternately, the solution of (6.16) per any of the variables A,θ and Y can be seen as an integrated marginal maximization (IMM) of p(Y,A X) = p(Y X,A)p(A X) that is approximated by the factored CK | | | one, i.e., ∏ p(xi yi,θ ,A)p(yi,θ )p(A xi), which is equivalent to: i=1 | | 1) Approximately maximizing with p(xi yi,θ ,A) and the prior p(θ ,yi) = p(θ yi)p(yi) | | over yi CK 2) Approximately maximizing with ∏ p(xi yi,θ ,A) and the prior p(θ ,yi) = p(yi θ )p(θ ) i=1 | | over θ CK 3) Approximately maximizing with ∏ p(xi yi,θ ,A) and the prior p(A) = p(A xi) over i=1 | | A. 6.4 Problem Formulation and Learning Algorithm 133

In this sense, based on the IMM principle, we propose an iterative, alternating algorithm that has three stages: (i) cluster and NT representation yi assignment, (ii) discrimination parameters θ update and (iii) linear map A update. Stage 1: Simultaneous Cluster and NT Representation Assignment Given the data sam- ples X and the current estimates of A and θ , the NT representation estimation problem is formulated as:

CK ˆ 1 2 Y = argmin AX Y F + ∑(λ0 fc(yi,θ ) + λ1 yi 1), (6.17) Y 2∥ − ∥ i=1 ∥ ∥ where λ0,λ1 are inversely proportional to the scaling parameters β0,β1 . { } { } We propose a solution for this stage that consists of two steps: (i) NT representations estimation and (ii) cluster index and NT representation assignment based on a min-max discrimination score.

NT Representations Estimation Given A,Y i = [y1,...,yi 1,yi+1,...,yCK] and θ , de- − \ − noting qi = Axi for any yi, problem (6.17) reduces to a constrained projection

1 2 T (PDRE R) : yˆi = argmin qi yi 2 + λ0 fc(yi,θ ) + λ11 yi . (6.18) − yi 2∥ − ∥ | |

T Using fc(yi,θ ) and assuming that v y = 0, per each pair τ c ,ν c , c1,c2 Cd Cs , | | ̸ { 1 2 } { } ∈ { × } (6.18) has a closed form solution as:

(6.19) y c1,c2 = sign(qi) max( qi t,0) k, |{ } ⊙ | | − ⊘ where e 1 t = λ0( v + g) λ11, h2 h − k = (1 + 2λ0τ c τ c ), 1 ⊙ 1 (6.20) + g = sign(max(qi,0)) τ + sign(max( qi,0)) τ − , ⊙ c1 − ⊙ c1 + v = sign(max(qi,0))ν + sign(max( qi,0))ν − . c2 − c2 The variable e is:

2 (h + cs) T 1 T T e = g qk λ0 g gk λ1g lk cs , (6.21) h2 + λ gT v | | − h + c − − 0 k  s  while cs = 0, vk = v k,gk = g k, qk = q k and h is a solution to a quartic polynomial ⊘ ⊘ | | | |⊘ (the proof is given in Appendix D.1). Discriminative Assignment This step consists of two parts. − 134 Clustering with NT and Discriminative Min-Max Assignment

Part 1 Given the estimated y c1,c2 , c1,c2 Cd Cs, the first part evaluates a |{ } { } ∈ × score related to fc(yi,θ ) as follows:

ρ(y c1,c2 ,τ c1 ) |{ } sP(c1,c2) = + ς(y c1,c2 ,τ c1 ). (6.22) ρ(y c1,c2 ,ν c2 ) | } |{ }

Part 2 Based on the score (6.22), the second part simultaneously assigns the cluster index and the NT representation yi using:

cˆ1,cˆ2 =argminsP(c1,c2), (6.23) { } c1,c2

cˆ =cˆ2 + (cˆ1 1)Cs, −

yˆi =y cˆ1,cˆ2 . (6.24) |{ }

Note that the approximation on the maximum log likelihood for discrimination w.r.t. fc(yi,θ ) is equivalent to computing a minimum score over sP as in (6.24).

Stage 2: Parameters θ Update Given the estimated NT representations yi and the linear map A, the problem related to update of the parameters θ reduces to the following form:

CK ˆ θ = argmin ∑ [ua(Axi,θ )+λ0 fc(yi,θ )]+λEup(θ ), (6.25) θ i=1

where λE is inversely proportional to βE and up(θ ) is the measures described in section 6.3.2.

Simplification Under Known yi Note that in the discriminative assignment step − (Stage 1, part 2), for each yi the corresponding τ c1 and ν c2 are known,

(Ass) : yi, y c ,c ,τ c ,ν c . (6.26) { { | 1 2 1 2 }}

In the following, we explain how we use relation (Ass), to simplify problem (6.25) and update the parameters θ .

First, using Ass, we denote the indexes c1 and c2 of the corresponding parameters τ c1 and

ν c2 that are used in the evaluation and the assignment of Axi to yi, as follows:

z1(i) = c1 : yi, y c ,c ,τ c ,ν c , { { | 1 2 1 2 }} (6.27) z2(i) = c2 : yi, y c ,c ,τ c ,ν c , i 1,...,CK . { { | 1 2 1 2 }} ∀ ∈ { } 6.4 Problem Formulation and Learning Algorithm 135

Second, knowing Ass, we do not evaluate the terms fc(yi,θ ) and minc1 minc2 Axi τ c1 2 2 ∥ − − ν c . Moreover, we use Ass, to express minc minc Axi τ c ν c as: 2 ∥2 1 2 ∥ − 1 − 2 ∥2

1 2 1 2 minmin Axi τ c1 ν c2 2 = Axi τ z1(i) ν z2(i) 2, (6.28) 2 c1 c2 2 ∥ − − ∥ (Ass) ∥ − − ∥

Update Per Single τ c1 Given A, Y, θ c1 , and using (Ass) and (6.28), problem (6.25), − \ per τ c1 reduces to: 1 ρ(y ,τ ) τˆ = argmin Ax τ ν 2 + i c1 + (y , τ ) + τ c1 ∑ i τ c1 ν z2(i) 2 λ0 ς i τ c1 τ c 2∥ − − ∥ ρ(y ,ν ) 1 i: " i z2(i) !# (6.29) z1(i)==∀ c1

λE fc(τ c1 ,θ c1 ), \ The solution for (6.29) is similar to the solution for (6.19) given by (6.22) and (6.24). The difference is that in (6.29) both of the respective thresholding and normalization vectors have additional terms (the proof is given in Appendix D.2).

Update Per Single ν c2 Given A, Y, θ c2 and using (Ass) and (6.28) problem (6.25), − \ per ν c2 reduces to:

1 ρ(yi,τ z (i)) νˆ = argmin Ax τ ν 2 + 1 + ν c2 ∑ i τ z1(i) ν c2 2 λ0 ν c 2∥ − − ∥ ρ(yi,ν c ) 2 i:  2  (6.30) z2(i)==∀ c2

λE fc(ν c2 ,θ c2 ). \

In this update, (6.30) is solved iteratively, where the solution per each iteration is identical to the solution for (6.19) given by (6.22) and (6.24). However, the difference is that ν c2 is estimated w.r.t. a thresholding vector that has an additional term (the proof is in Appendix D.3). Stage 3: Linear Map A Estimation Given the data samples X, the corresponding transform representations Y and the discrimination parameters θ , the problem related to the estimation of the linear map A, reduces to:

CK ˆ 1 2 1 2 A = argmin AX Y F + Axi τ z (i) ν z (i) 2+ A 2 2 ∑ 1 2 ∥ − ∥ i=1 ∥ − − ∥ (6.31) λ2 2 λ3 T 2 T A + AA I λ4 log detA A , 2 ∥ ∥F 2 ∥ − ∥F − | | where λ2,λ3,λ4 are inversely proportional to the scaling parameters β3,β4,β5 . We { } { } use the approximate closed form solution which is similar to the one given in Chapter4 136 Clustering with NT and Discriminative Min-Max Assignment

a)

b) c) Fig. 6.3 The evolution of a) the objective related to the problem of simultaneous cluster and NT representation assignment, b) the expected NT error and c) the expected discrimination min-max functional score per iteration for the proposed algorithm on the ORL [132], COIL [105], E-YALE-B [47] and AR [96] database.

1 2 (please see also Appendix B.1 and B.2). The only difference is the term 2 AX Y F + 1 CK 2 ∥ 2− ∥ ∑ Axi τ ν which can also be expressed in a forms as AX YT , where: 2 i=1 ∥ − z1(i) − z2(i)∥2 ∥ − ∥F (6.32) YT = y1 + τ z1(1) + ν z2(1),...,yCK + τ z1(CK) + ν z2(CK) ,   while τ z1(i) and ν z2(i) denote denote the corresponding τ c1 and ν c2 that appear in the NT, which is used to estimate yi, i 1,...,CK . ∀ ∈ { } 6.5 Evaluation of the Proposed Approach This section evaluates the advantages and the potential of the proposed algorithm and compares its clustering performance to the state-of-the-art methods. 6.5 Evaluation of the Proposed Approach 137

COIL ORL E-YALE-B AR κn(A)µ(A) t κn(A)µ(A) t κn(A)µ(A) t κn(A)µ(A) t * 16 .2e-5 46 21 .3e-5 48 31 .1e-5 51 28 .3e-5 69 Table 6.1 The computational efficiency per iteration t[sec] for the proposed algorithm, the conditioning number κn(A) and the expected mutual coherence µ for the liner map A.

6.5.1 Data Sets, Algorithm Setup and Performance Measures Data Sets The used data sets are E-YALE-B [47], AR [96], ORL [132] and COIL [105]. All the images from the respective datasets were downscaled to resolutions 21 21, 32 28, × × 24 24 and 20 25, respectively, and are normalized to unit variance. × × Algorithm and Clustering Setup The used setup is described in the following text. On-Line Version An on-line variant is used for the update of A w.r.t. a subset of the − available training set. It has the following form At+1 = At ρ(At Aˆ ) where Aˆ and At are − − the the solutions in the transform update step at iterations t + 1 and t, which is equivalent to having the additional constraint At Aˆ 2 in the related problem. The used batch size is ∥ − ∥F equal to 87%,85%,90% and 87% of the total amount of the available training data from the respective datasets E-YALE-B, AR, ORL and COIL. Clustering Setup, Cluster Index and NT Estimation We assume that the number of − clusters C per database is known. We set the number of parameters that are related to the dissimilarity τ c ,c1 1,...,Cd to be close to the number of actual clusters C, i.e., Cd = C 1 ∈ { } and we set the number of parameters ν c ,c2 1,...,Cs related to the similarity to be small, 2 ∈ { } i.e., Cs is small. The cluster index c and the NT are estimated based on the minimum score of the discriminative functional measure as explained in Section 6.4.2. As an evaluation metric for the clustering performance we use the cluster accuracy (CA) and the normalized mutual information (NMI) [18].

Algorithm Parameters, Initialization and Termination The parameters λ0 = λ1 = 0.03, − λE = 0.001, λ2 = λ3 = λ4 = 16, the transform dimension is M = 2100. The algorithm is initialized with A and θ having i.i.d. Gaussian (zero mean, variance one) entries and is terminated after the 100th iteration. The results are obtained as the average of 5 runs.

6.5.2 Numerical Experiments

Summary Our experiments consist of three parts. NT Properties In the first series of the experiments, we investigate the properties − of the proposed algorithm. We measure the run time t of the proposed algorithm, the conditioning number κ (A) = σmax (σ and σ are the smallest and the largest singular n σmin min max 138 Clustering with NT and Discriminative Min-Max Assignment

COIL ORL E-YALE-B AR CA % 89.2 75.4 96.8 94.8 NMI % 91.2 84.1 95.3 94.1 Table 6.2 The clustering performance over the databases COIL, ORL, E-YALE-B and AR evaluated using the Cluster Accuracy (CA) and the Normalized Mutual Information (NMI) metrics.

CA % NMI % COIL ORL YALE-B COIL ORL YALE-B CASS [89] 59.1 68.8 81.9 CASS [89] 64.1 78.1 78.1 GSC [168] 80.9 61.5 74.2 GSC [168] 87.5 76.2 75.0 NSLRR [164] 62.8 55.3 / NSLRR [164] 75.6 74.5 / SDRAM [54] 86.3 70.6 92.3 SDRAM [54] 89.1 80.2 89.1 RGRSC [73] 88.1 76.3 95.2 RGRSC [73] 89.3 86.1 94.2 ( ) 89.2 75.4 96.8 ( ) 91.2 84.1 95.3 ∗ ∗ Table 6.3 A comparative results between state-of-the-art [89], [168], [164], [54] and [73], and the proposed method ( ). ∗

values of A, respectively) and the expected mutual coherence µ(A), which we compute as described in Chapters4 and5 of the shared linear map A in the learned NTs. In addition, we show the evolution of the objective related to the problem of simultaneous cluster and NT representation assignment, the expected NT error and the expected min-max functional score per iteration for the proposed algorithm. Clustering and k-NN Classification Performance In the second part, we measure the − cluster performance across all databases and report the CA and NMI. In this part, we also split every databases on training and test set and learn NTs with the proposed algorithm on the training set. We use the learned NTs to assign a representation for the test data. Then we preform a k-NN [27] search using the test NT representation on the training NT representation. Proposed Method vs State-Of-The-Art This part compares the proposed method w.r.t. − results reported by five state-of-the-art methods, including: GSC[168], NSLRR [164], SDRAM [54] and RGRSC[73].

COIL ORL E-YALE-B AR acc. NT 97.1 96.9 96.8 96.0 acc. OD 94.0 94.5 93.4 91.6 Table 6.4 The k-NN accuracy results using assigned NT representations and original data (OD) representation. 6.6 Summary 139

Evaluation Results The results are shown in Tables 6.1, 6.2, 6.3 and 6.4, and Figure 6.3. NT Properties The learned NTs for all the data sets have relatively low computational − time per iteration. All NT have good conditioning numbers and low expected coherence. In Figure 6.3 is shown the evolution of the objective values where we note a decreasing behavior per iteration on all 3 Figures. The learning algorithm is able to estimate the parameters in the NTs that minimize the score of the discriminative measure. Clustering Performance The results of the clustering performance over the databases − E-YALE-B [47], AR [96], ORL [132] and COIL [105] are shown in Figure 6.2. We see that both the CA and the NMI measures have high values. The highest performance is reported on the E-YALE-B [47] databases where the CA and NMI are 96.8% and 95.3%, respectively. k-NN Classification Performance The results of the k-NN performance on all databases − is shown in Figure 6.4. As a baseline we use k-NN on the original data and report improve- ments of 3.1%, 2.4%, 3.3% and 4.4% over the baseline results for the respective databases. Proposed vs State-Of-The-Art Clustering The results are shown on Figure 6.3. As we − see the proposed algorithm outperforms the state-of-the-art methods CASS [89], GSC [168], NSLRR [164], SDRAM [54] and RGRSC[73]. The highest gain in CA and NMI w.r.t. the state-of-the-art is 1.6% and 1.9%, respectively, that is achieved on the E-YALE-B [47] and the COIL [105] databases, respectively.

6.6 Summary

A novel clustering concept was introduced where we (i) jointly learn the NTs with priors and (ii) simultaneously assign the cluster and the NT representation based on the maximum likelihood over functional measure. Given the observed data, an empirical approximation to the maximum likelihood of the model gives the corresponding problem formulation. We proposed an efficient solution for learning the model parameters by a low complexity iterative alternating algorithm. The proposed algorithm was evaluated on publicly available databases. The preliminary results showed promising performance. In a clustering regime w.r.t. the used CA and NMI measures, the algorithm gives improvements compared to the state-of-the-art methods. In unsupervised k-NN classification regime, it demonstrated high classification accuracy. Performance evaluation on other data collections, together with comparative evaluation for other similarity dissimilarity measures is left for future work.

Chapter 7

Conclusions

In this chapter, we summarize the main conclusions for the presented work based on the nonlinear transform model that allowed us to cover a number of applications. These include ACFP, image denosing, learning of NT for sparse and discriminative representations that are useful for image recognition tasks and clustering.

7.1 NT Model and IMM Principle

In our work, contrary to the synthesis model, which was based on data reconstruction, we introduced the novel generalized nonlinear transform model, which is based on constrained data projection. In order to accordingly estimate the parameters of the different versions of our NT model, we proposed the generalization to an integrated maximum marginal principle. It is based on an approximation to the empirical expectation of the model negative log likelihood. We showed an approximate and exact closed form solutions as well as that the implementations by the iterative alternating algorithms enables efficient solutions across the considered applications. Moreover, we provided the theoretical guarantee regarding the convergence to a local minimum for the majority of the proposed algorithms.

7.2 NT for ACFP

We considered several practical ACFP setups. We introduced and analyzed the generalized problem formulation. Along this way, under linear modulation and linear feature map, we presented a reduction of the ACFP problem to a low complexity constrained projection problem and under invertible linear feature maps, we proposed the closed form solution 142 Conclusions

as well as, we addressed approximations to the linear feature map in order to attain low modulation distortion. Furthermore, in order to estimate the data adaptive linear map that provides a small ACFP modulation distortion and features with the targeted properties, we presented a novel problem formulation that jointly addresses the fingerprint learning and content modulation. We proposed a solution with an iterative alternating algorithm with global optimal solutions for the respective iterative steps and we provided a convergence guarantee to a local optimal solution. Finally, we introduced the concept of ACFP-LR. We proposed the novel general problem formulation described by a latent representation, extractor and reconstructor functions. The simplified problem formulation with linear modulation addresses the estimation ofalatent data representation with constraints on the modulation distortion, while using linear feature maps. Since we only considered predefined fixed linear maps, one future extension ofthis approach can be seen in the direction, where the linear maps are learned. Based on the provided computer simulation using local image patches extracted from pub- licly available data set we highlight that ACFP-LR demonstrated superior performance under AWGN, lossy JPEG compression and affine geometrical transform distortions compared to the PCFP, ACFP and ACFPL schemes.

7.3 NT for Image Denoising

We showed that by discarding the discrimination parameters, our NT model reduces to the sparsifying transform model. In the same chapter, we considered the learning problem for the sparsifying transform model with a non-structured overcomplete transform matrix. We presented an iterative, alternating algorithm that has two steps: (i) linear transform update and (ii) sparse coding. The key was the linear transform update step, where we introduced a novel problem formulation, which itself was a lower bound of the original objective for the same update step. This allowed us to not only propose an approximate closed form solution, but also gave us the possibility to find the update that can lead to accelerated local convergence. Atthe same time it enabled us to estimate an update that provides a satisfactory solution under a small amount of training noisy image patches. Although, we proved a convergence guarantee for the iterative algorithm, we leave the exact characterization and the fundamental limit in the trade-offs of the transform update step 7.4 NT for Image Recognition 143 for future work. We also point out that we analyzed the lower bound only w.r.t. a trade-off. Another possibly useful analysis is towards a measure for robustness. We validated the algorithm with numerical experiments, where the results confirm that the main advantage of the current version of the proposed algorithm is the implicit preservation of the gradient and the ability to rapidly decrease the transform error and thereby the objective per iteration. This resulted in the fast convergence and small data requirements to learn the model parameters, while achieving competitive denoising performance w.r.t. the compared methods from the same class. Finally, another interesting direction is towards learning a sparsifying transform model with denoising specific self-collaboration. This idea, is similar in spirit to the approach proposed in Chapter5, Section 3. However, the difference is that the target now would be robust denoising estimate with averaging over multiple estimates form the self-collaborating sparsifying transform models.

7.4 NT for Image Recognition

In the first subsection, we considered the face recognition problem from both machine learn- ing and information coding perspective, adopting an alternative way of visual information encoding and decoding. Our model for recognition was based on multilevel vector quan- tization that is conceptually equivalent to bag-of-words while the multilevel scheme bears similarity to CNN. A key ingredient in our approach was the robust estimate of the local NT representation, where we gave a generalization result with a solution of a constrained projec- tion problem and showed the correspondence of the sparse likelihood approximation to the maximum a posterior estimation. In addition, we provided a probabilistic interpretation to the hard thresholding, soft thresholding and ℓ1 -norm ratio constrained likelihood approximation. ℓ2 The computer simulation demonstrated improvements and competitive performance of the proposed method over the sparse representation based recognition on several popular face image databases in terms of recognition accuracy, computational complexity and memory usage. In the second subsection, we presented the novel approach for learning discriminative and sparse representations by introducing the nonlinear transform model with priors. We addressed the estimation of the model parameters by minimizing an approximation to the empirical log likelihood of our model. Concerning our novel discriminative prior, we analyzed the relation to the empirical risk and gave connections to a discriminative learning objective, that we addressed by our learning problem formulation. Based on an integrated marginal maximization principle, we proposed a low complexity learning strategy, that we 144 Conclusions

implemented using an iterative alternating algorithm with two steps that have exact and approximate closed form solutions. The efficiency of our approach, together with the potential usefulness of the nonlinear transform representations, was validated by numerical experiments in a supervised and unsupervised image recognition setup. The preliminary results w.r.t. the introduced measures and the recognition accuracy on the used databases showed promising performance. A study on the recognition capabilities for large scale databases is our next step. An extension considering maximization of the discrimination power in the transform domain, under supervised and unsupervised case is left for our future work. Another interesting direction would be to consider explicit estimation of the discrimination parameters while addressing an evaluation of the discrimination prior measure, which would correspond to a multi-class clarifier, that we also leave for future work. In the third subsection, we introduced a novel collaboration structured model with mini- mum information loss, collaboration corrective and discriminative priors for joint learning of multiple nonlinear transforms for image recognition task. The model parameters were learned by addressing an integrated marginal maximization that corresponds to minimizing an unnormalized empirical log likelihood of the model. An efficient solution was proposed by an iterative, coordinate descend algorithm with convergence guarantees. The preliminary results w.r.t. the introduced measure and the recognition accuracy on the used databases showed promising performance and advantages w.r.t. state-of-the-art methods.

7.5 Clustering with NT

Taking into account the full expressive power of our NT model, we introduced a novel discriminative clustering and assignment principle that was centered around two elements: a joint modeling and learning of nonlinear transforms with priors and, simultaneous cluster and NT representation assignment based on the approximation to a discrimination prior log likelihood. To the best of our knowledge, our novel discriminative assignment principle is the first of this kind that: (a) Addresses a trade-off between robustness in the cluster assignment and the NT repre- sentation compactness by allowing a reduction or extension of the NT dimensionality, while increasing or decreasing the number of the discrimination parameters (b) Offers a discriminative cluster assignment concept that can be implemented over a wide class of similarity score functions including a min-max while enabling efficient estimation of the NT representation 7.5 Clustering with NT 145

(c) Allows a rejection option and cluster grouping over continues discontinues and over- lapping regions in the transform domain.

Given the observed data, an empirical approximation to the maximum likelihood of the model gives the corresponding objective in the addressed problem formulation. To that end, we proposed an efficient solution for learning the model parameters by a low complexity iterative alternating algorithm. We evaluated the proposed principle on several publicly available image data sets. The preliminary results showed promising performance. In a clustering regime w.r.t. the used CA and NMI measures, the algorithm gives improvements compared to the state-of-the-art methods. In an unsupervised k-NN classification regime, it demonstrated high classification accuracy. We leave the analysis of the trade-off between the number of clusters, the number of NT and the transform dimension as well as of its fundamental limits as one future direction. In addition, we note that we only presented the principle based on hard assignment. Another interesting extension includes the principle of soft assignment based on aggregation function over the ”softly” selected NT representations, that we leave for future work.

Appendix A

A.1 Proof of Theorem 1

By definition, problem (3.22) implies that y = x + xe where xe is the error term. A multipli- cation from left side by the matrix F results in Fy = Fx +Fxe. This expression in term of the error is: Fxe = Fy Fx. − Define ts = Fy and to = Fx, the closest vector (in Euclidean sense) ts to to is a solution of the problem:

1 T tˆs = argmin (to ts) (to ts) ts 2 − − subject to (A.1)

ts e τ1. | | ≥ Theorem A.1: The optimal solution to (A.1) is:

tˆs = sign(to) max to ,τ1 . (A.2) ⊙ {| | } For the proof please see Appendix A.2. Replace the solution of (A.1) in the error term, use sign magnitude decomposition of Fx and reorder:

Fxe = sign(Fx) max Fx ,τ1 Fx ⊙ {| | } − (A.3) = sign(Fx) max τ1 Fx ,0 = te, ⊙ { − | | } if the pseudo inverse F† of F exists than the closed form solution to (3.22) is:

† y = x + F te  (A.4) 148

A.2 Proof of Theorem A.1

An equivalent problem representation to (A.1) is:

1 T tˆs = argmin (to ts) (to ts) ts 2 − − subject to (A.5) ts I (to 0) e τ1 I (to 0) ⊙ ≥ ≥ ⊙ ≥ ts I (to < 0) e τ1 I (to < 0). ⊙ ≤ − ⊙

Let λ 1 = λ 1,+ I (to 0) and λ 2 = λ 2, I (to < 0) then the Lagrangian for (A.5) is: ⊙ ≥ − ⊙ 1 T l (ts,λ 1,λ 2) = (to ts) (to ts) 2 − − (A.6) T T λ ( ts + τ1) λ (ts + τ1). − 1 − − 2

By setting the first order derivative of (A.6) with respect to ts, the optimal tˆs can be expressed ˆ ˆ in terms of the optimal dual variables λ 1 and λ 2, and this expression is: ˆ ˆ tˆs = to + λ 1 λ 2. (A.7) − Substituting (A.7) in (A.5), we have the dual problem:

ˆ ˆ 1 T λ 1,λ 2 = arg max ( λ 1 + λ 2) ( λ 1 + λ 2) { } λ 1,λ 2 − 2 − − − T T λ 1 ( to + τ1) λ 2 (to + τ1) − − (A.8) subject to

λ 1 e 0 ≥ λ 2 e 0. ≥ ˆ ˆ ˆ ˆ The optimal λ 1 and λ 2 must satisfy λ 1 λ 2 = 0, implies that i if λ1(i) = 0 then λ2(i) = 0 ⊙ ∀ or λ2(i) = 0 and conversely, under the constraints λ1(i) 0 and λ2(i) 0. Therefore, (A.8) ̸ ≥ ≥ can be split and solved independently for λ 1 and λ 2 under the constraint λ 1 λ 2 = 0. The ⊙ two sub-problems are: + ˆ 1 T T P : λ 1 = argmax λ 1 λ 1 λ 1 ( to + τ1) λ 1 − 2 − − (A.9) subject to

λ 1 e 0. ≥ A.2 Proof of Theorem A.1 149

ˆ 1 T T P− : λ 2 = argmax λ 2 λ 2 λ 2 (to + τ1) λ 2 − 2 − (A.10) subject to

λ 2 e 0. ≥

By taking the first order derivative in (A.9) and (A.10) with respect to λ 1 and λ 2 respectively ˆ ˆ and equaling to zero, the optimal solution for (A.8) that satisfes the constraint λ 1 λ 2 = 0, ˆ ˆ ⊙ λ 2 e 0 and λ 1 e 0 is: ≥ ≥

ˆ λ 1 = I (to 0) max τ1 to,0 ≥ ⊙ { − } (A.11) ˆ λ 2 = I (to < 0) max τ1 + to,0 , ⊙ { } substituting (A.11) back in (A.7), reordering and using the sign magnitude decomposition of to we have: ˆ ˆ tˆs = to + λ 1 λ 2 = sign(to)max to ,τ1 , (A.12) − {| | } that gives Theorem 2 

Appendix B

B.1 Proof for the Global Optimal Solution

Given X and the current estimate of Y, the estimate of the transform is a solution to the following problem:

ˆ 2 λ2 2 λ3 T 2 T A = argmin AX Y 2 + A F + AA I F λ4 log detA A . (B.1) A ∥ − ∥ 2 ∥ ∥ 2 ∥ − ∥ − | |

Theorem 2 (global optimal solution) Given X ℜN CK and Y ℜM CK, if and only if the ∈ × ∈ × joint decomposition T 2 T XX = UX ΣX UX (B.2) T T XY = UX ΣXY VXY , N N M N exists, where UX ℜ × is orthonormal, VXY ℜ × is per columns orthonormal and N N ∈ ∈ ΣX ,ΣXY ℜ are diagonal matrices with positive diagonal elements, then (B.1) has a ∈ × global minimum as 1 T A = VXY ΣAΣX− UX , (B.3)

ΣA(n,n) = σA(n), n,σA(n) 0 and σA(n) are positive solutions to ∀ ≥ 2 λ3 4 σX (n) 2λ3 2 σˆA(n) = arg min σA(n) + − σA(n) σ (n) σ 4(n) σ 2(n) − A X X (B.4) σXY (n) σX (n) σA(n) 2λ4 log . σX (n) − σA(n)

Proof of Theorem 2 Consider the equivalent trace form of (B.1)

T T Aˆ = argminTr (AX Y) AX Y + λ2Tr A A + A { − − } { } (B.5) T T T T λ3Tr (AA I) (AA I) λ4 log A A . { − − } − | | 152

T Note that since λ2 0,XX +λ2I is a symmetric positive definite matrix with all eigenvalues ≥ non-negative, therefore, it decomposes as

2 T T T T UX ΣX UX = UX ΣX UX UX ΣX UX = XX + λ2I. (B.6)

Let 1 T A = BD,D = UX Σ− UX , (B.7) Define T T T T T T T g1 = BDXY ,g2 = BB , g3 = (BDD B )(BDD B ) , (B.8) T T T T g4 = (BDD B ), g5 = log detBDD B , | | Then (B.1) equivalently is

Bˆ = argmin Tr g1 + Tr g2 + λ3Tr g3 g4 λ4g5. (B.9) B − { } { } { − } −

Asumme that B decomposes as T UBΣBVB, (B.10)

where ΣB is a diagonal matrix with positive diagonal elements, UB is column orthogonal and T VB is orthogonal square matrix. Moreover, let the following decomposition on XY exists

T T XY = UX ΣXY VXY , and substitute as UB = VXY ,VB = UX , (B.11)

then T 1 T T 1 Tr g1 = Tr UBΣBV UX Σ− U XY = Tr ΣBΣ− ΣXY . (B.12) { } { B X X } { X } The term T T T T 2 Tr g2 = BB = Tr (UBΣBV )(UBΣBV ) = Tr Σ , (B.13) { } { } { B B } { B} and T T 1 1 2 2 Tr g4 = BDD B = Tr ΣBΣ− Σ− ΣB = Tr Σ− Σ . (B.14) { } { } { X X } { X B} T T T T T 4 4 Tr g3 = (BDD B )(BDD B ) = Tr Σ− Σ , (B.15) { } { } { X B} T T T g5 =log detAA = log detD B BD = | | | | (B.16) 1 2 1 T 2 2 log detUX Σ− Σ Σ− U = log detΣ− Σ . | X B X X | | X B| B.2 Proof for the Approximate ε-Close Closed Form Solution 153

Finally, (B.1) is reduced to

N λ σ 2(n) 2λ ˆ ˆ 3 4 X 3 2 σB(1),...,σB(N) = arg min ∑ 4 σB(n) + 2 − σB(n) { } σB(1),...,σB(N) σ (n) σ (n) − n=1 X X (B.17) σXY (n) σX (n) σB(n) 2λ4 log , σX (n) − σA(n) equaling to zero the first order derivative of the objective (B.16) w.r.t. σB(n) and multiplying by σB(n) gives

2 λ3 4 σX (n) 2λ3 2 σXY (n) 4 4 σB(n) + 2 2 − σB(n) σB(n) 2λ4 = 0. (B.18) σX (n) σX (n) − σX (n) −

A closed form solution to (B.17) exists and depends on the discriminant of the quartic λ3 polynomial. Moreover, since 4 4 is positive a global minimum to (B.1) exists if and only σX (n) if the decompsition (B.2) exists 

B.2 Proof for the Approximate ε-Close Closed Form Solu- tion

T Consider the trace (B.9) of (B.1) . Note that since λ2 0,XX + λ2I is a symmetric positive ≥ definite matrix with all eigenvalues non-negative. Therefore it decomposes as:

2 T T T T UX ΣX UX = UX ΣX UX UX ΣX UX = XX + λ2I. (B.19)

Let the following decomposition exists

T T (B.20) UUX XY ΣUX XY VUX XY = UX XY .

Define 1 T A = BD, where D = UX ΣX− UX . (B.21) T Assume that B decomposes as UBΣBVB = B, where ΣB is a diagonal matrix with positive diagonal elements, UB is column orthogonal and VB is orthogonal square matrix and let

T T (B.22) UB = (UUX XY VUX XY ) ,VB = UX , 154

then T 1 T T Tr AXY =Tr BUX Σ− UX XY = { } { X } (B.23) T 1 T Tr VU XY U ΣBΣ− UU XY UU XY Σ . { X UX XY X X X VX XY } T Consider the decomposition UBΣBVB of B, use [100] and [107] and note that

T 1 T T 1 min max Tr UBΣBVBUX ΣX− UX XY minTr ΣBΣX− ΣΓ , (B.24) ΣB UB,VB { } ≤ ΣB { }

where ΣΓ is a diagonal matrix, having diagonal elements ΣΓ(n,n) = σΓ(n) = T(n,n), n N ∀ ∈ and T = U Σ VT . UX XY ΣUX XY UX XY Note that the term

Tr (AX)(AX)T = Tr BBT = Tr Σ2 , (B.25) { } { } { B} and as in the subsection C.0

T T T 2 2 Tr AA =Tr BDD B = Tr ΣBΣ− , { } { } { X } (B.26) T T T T T T T 4 4 Tr (AA )(AA ) =Tr (BDD B )(BDD B ) = Tr Σ Σ− , { } { } { B X } and log detAAT = log detDT BT BD = | | | | (B.27) 1 2 1 T 2 2 log detUX Σ− Σ Σ− U = log detΣ− Σ . | X B X X | | X B| Finally, the aproximation of (B.1) using the bound (B.24) is reduced to

N λ σ 2(n) 2λ ˆ ˆ 3 4 X 3 2 σB(1),...,σB(N) = arg min ∑ 4 σB(n) + 2 − σB(n) { } σB(1),...,σB(N) σ (n) σ (n) − n=1 X X (B.28) σΓ(n) σX (n) σB(n) 2λ4 log , σX (n) − σA(n)

equaling to zero the first order derivative of the objective (B.28) w.r.t. σB(n) and multiplying by σB(n) gives

2 λ3 4 σX (n) 2λ3 2 σΓ(n) 4 4 σB(n) + 2 2 − σB(n) σB(n) 2λ4 = 0. (B.29) σX (n) σX (n) − σX (n) −

A closed form solution to (B.29) exists and depends on the discriminant of the quartic λ3 polynomial. Moreover, since 4 4 is positive a global minimum to (B.28) exists. σX (n) Therefore, having the decomposition U Σ VT = B, the substitutions U = (U VT )T BΣB B B UX XY UX XY and VB = UX with the solution of (B.28) gives the ε-close closed form approximative solution B.2 Proof for the Approximate ε-Close Closed Form Solution 155 to problem (B.1) as T 1 T (B.30) A = VUX XY UUX XY ΣBΣX− UX , where (B.24) implies that the ε-close closed form approximative solution is a lower bound to the solution of (B.1) 

Appendix C

C.1 Proof of Theorem 3

Given the likelihood p(d j,s b j,i), using the Bayesian rule, the posterior probability is |

p(b j,i d j,s) ∝ p(d j,s b j,i)p(b j,i) (C.1) | | Consider the tilted likelihood probability:

exp(λd(b j,i d j,s)) pλ (d j,s b j,i) = p(d j,s b j,i) | (C.2) | | E exp(λd(b j,i b j,s)) |   as the posterior p (d j,s b j,i) = p(b j,i d j,s). A proof that tilting trades the mean for divergence λ | | is given by [115].

Given the available estimate p(d j,s b j,i), for any λ ℜ, p(b j,i d j,s) = p (d j,s b j,i) | ∈ | λ | diverges from the initial estimate p(d j,s b j,i), ρ p p(d j,s p(b j,i d j,s) = 0 and ρ(.) is a | λ | | ̸ divergence measure. Equivalently since the prior: 

exp(λd(b j,i d j,s)) p b j,i = | (C.3) E exp(λd(b j,i d j,s))  |   is a constraint on the posterior p(b j,i d j,s) = p (d j,s b j,i), the prior p(b j,i) induces diver- | λ | gence on p(d j,s b j,i) towards p(b j,i d j,s). | | Define the likelihood vector:

l j,i = p(d j,1 b j,i), p(d j,2 b j,i),..., p(d j,S b j,i) (C.4) | | | S   ∑ p(d j,s b j,i) = 1, p(d j,s b j,i), s 1,...,S 0,S < ∞ then the interpretation of the s=1 | | ∀ ∈ { } ≥ maximum a posterior estimate can be considered as a minimum divergence under the prior 158

constraints that equals to a solution of a constrained projection problem:

yˆ j,i =arg min ρ l j,i,y j,i y j,i Θ ∈ (C.5)  subject to m y j,i = 0  C.2 Proof of Theorem 4

A single iteration consist of an inner projection on a plane, then an outer projection on a ball and an inner projection on the positive quadrant [60], i.e.,

q+1 q x = Π+ ( Πl ( Πl (x ))), (C.6) ↓ ↑ 2 ↓ 1 where:

q q 1 q l1 ∑ j Λq x ( j) q Πl (x ) : z = x − ∈ i , (C.7) 1 − Nq  

1 2 1 q Πl2 (z ) : z = z + αd , q 1 l1 q d = z q i − N (C.8) αˆ = argminα2 (dq)T dq 2(dq)T z1 α 0 − ≥ T + z1 z1 l2, − 2  where iq = 1 xq > 0 0,1 S 1 is a vector whose elements have values only if the corre- { } ∈ { } × sponding elements of xq have non-zero values, and

q+1 2 Π+ : x (s) = max z (s),0 , (C.9) s 1,...,S .  ∀ ∈ { } q 1 q Define a temporary variable z = l1 (α + 1)∑ j Λq x ( j) Nq , where N is the number of − ∈ q q non-zeros values of the vector x at iteration . Define a set of indexes that point tothe C.2 Proof of Theorem 4 159 non-zeros values at xq+1 as Λq+1 = j : iq+1( j) = 0 and expand xq+1 2: { ̸ } ∥ ∥2 q+1 2 x 2 = ∥ ∥ z (xq( j))2 +2α xq( j) xq( j) + + ∑ ∑ α j Λq+1 j Λq+1 ∈ ∈   A z 2 α| 2 {z xq(}j) + > ∑ α j Λq+1 a ∈   q 2 2 q 2 x 2 + α + 2α ∑|{z}(x ( j)) + (C.10) ∥ ∥ j Λq+1  ∈ (2α + 1) ∑ zxq( j) + 2Nq+1 Nq z2 = j Λq+1 − ∈ q 2 q 2 q 2 x 2 + ∑ (x ( j) γ) > x 2 ∥ ∥ j Λq+1 − ∥ ∥ ∈ b ⇒ |{z} xq+1 2 > xq 2 > ... > x1 2. ∥ ∥2 ∥ ∥2 ∥ ∥2 In the inequality a in the third row of equation (C.10) we use expression about xq+1 in terms of it value at the previous iteration xq plus additional terms. Define the intersection set of indexes that point to the non-zeros values at xq and xq+1 as Γ = Λq Λq+1 and express A { ∩ } using the indexes that point to the non-zero values at iteration q minus the sum of values that are put to zero at iteration q + 1:

q 2 q 2 q 2 A : ∑ (x ( j)) = ∑ (x ( j)) ∑ (x ( j)) . (C.11) j Λq+1 j Λq − j Γ ∈ ∈ ∈ q q+1 By (C.9) i i xq ∞ < z , the second term in (C.11) has the following upper-bound: ∥ − ⊙ ∥ | |  (xq( j))2 max(xq( j))2 < z2 1. ∑ ∑ j ∑ (C.12) j Γ ≤ j Γ Γ j Γ ∈ ∈ ∈ ∈ q The sum ∑ j Γ 1 equals the difference of the number of non-zero values N at iteration q and ∈ the number of non-zero values Nq+1 at iteration q + 1.

q 2 2 q q+1 2 ∑ (x ( j)) > z ∑ 1 = N N z . (C.13) − j Γ − j Γ − − ∈ ∈  160

The inequality b uses the properties of the variable γ:

2 2 2Nq+1 Nq 2(α + 1)z 4z (α + 1) α(α + 2) q+− − − N 1 (C.14) γ = r  2α (α + 2)  

The projection on the positive quadrant Π+ and α > 0 constrains the number of non-zeros Nq+1 at iteration q + 1 to be smaller or equal to the number of non-zeros Nq at iteration q, Nq Nq+1. Rewrite the second term in the numerator in (C.14) and use Nq Nq+1: ≥ ≥ 2Nq+1 Nq 4z2 (α + 1)2 α(α + 2) − = − Nq+1    (C.15) Nq+1 Nq 4z2 α2 + 2α 1 2 + + 1 0, − Nq+1 Nq+1 ≥      therefore γ is a real number and (C.10) holds true. q+1 Similarity expand x 1 as: ∥ ∥ q+1 x 1 = ∥ ∥ q+1 q q 1 i (α + 1)x + l1 (α + 1) ∑ x ( j) q = − q N j Λ ! ! a ∈ |{z} q  q q q q q  1 s + es α + (l1 s )N (α + 1)N x ( j) − − ∑ Nq ≤  j Γ   ∈  b  B   z  |{z} q  q q q q Nq δ  1 (C.16) s + es α + (l1 s )N (α + 1)N e| {z− } − − | α + 1 | Nq ≤   c q q q q q z 1 s + es α + (l1 s )N N e δ |{z} − − |Nq − | Nq ≤   d q q q q q+1 q q s + (l1 s )(e + N ) δ N N s = x 1 − − − ≤ |{z}∥ ∥ e  ⇒ |{z} q+1 q 1 x 1 x 1 ... x 1. ∥ ∥ ≤ ∥ ∥ ≤ ≤ ∥ ∥ C.2 Proof of Theorem 4 161

q+1 q q q The equality (a) is obtained by defining e = N N , s = ∑ j Λq x ( j) and reordering. − ∈ The inequality (b) comes from the lower-bound on the B term in (C.16):

q q z δ x ( j) minx ( j) e − ,δ 0. (C.17) ∑ j j Γ ≥ Γ ≥ | α + 1| ≥ ∈ ∈ Since α 0 inequality (c) is used. ≥ The inequality (d) comes from the use of an inverse form of the Pythagorean theorem a,b R, a b a a and reordering. Further since a < 0 and b > 0 then a ∀ ∈ −| − | ≤ −|| | − | || −| − b a a = a b. The inequality (e) uses the following: | ≤ −|| | − | || −

l1 = l2 √N s(√N 1) = − − ≥ q   √ x 1 N ∥xq∥ ≥ ∥ ∥2 q x 1 |{z} l2 (1 s) ∥ q∥ + s − x 2 ≥ 1 2 q  ∥ ∥  x 2< x 2<...< x 2 ℓ2 1 1 1 2 1 2∥ ∥ ≤ q ≤ ∥ ∥ ≤ ∥ ∥ ∥ ∥ 2 1 x 1 x 1 x 1 ... > x 1 > x 2, then the sequence ∥ 1∥ , ∥ 2∥ ,..., ∥ q∥ is strictly increasing and ∥ ∥ ∥ ∥ {− x 2 − x 2 − x 2 } is a Cauchy sequence. Every strictly increasing∥ ∥ ∥ Cauchy∥ sequence∥ ∥ is finite, therefore the solution to the ℓ1 -norm constrained projection converges to the optimal point in finite number ℓ2 of iterations.  162

C.3 Proposition of Likelihood w.r.t Similarity and Dissimi- larity

M CK Let CK representations [Y c,k ,yc,k] ℜ × be given, where Y c,k denotes the matrix \{ } ∈ \{ } contain all yc1,k1 , except yc,k, c1 = c and k1 = k. Let the probability (likelihood) between two ̸ ̸ f (y ,y ) 1 c,k c1,k1 representations yc,k and yc ,k be p(yc,k yc ,k ) exp( ). 1 1 | 1 1 ∼ η − σ Likelihood vector The likelihood vector is defined as follows 1 rc,k p(yc,k yc ,k ),..., p(yC,K yc ,k ) , (C.19) ≃ η | 1 1 | 1 1   f (yc ,k ,yc,k) M M C K 1 1 where rc,k Θ, f (.,.) : ℜ ℜ ℜ+ is a distance measure [13], η = ∑c=1 ∑k=1 e− σ ∈ × → C K is normalization factor, σ is a scaling parameter and Θ rc,k : ∑ ∑ rc,k(ic ,k ) = ∈ { c1=1 k1=1 1 1 1,rc,k(ic,k) 0,ic ,k = c1 + (c1 1)k1, c1,k1 is the probability simplex. ≥ 1 1 − ∀ }

Likelihood w.r.t. Similarity and Dissimilarity We denote vectors r Θ (or r¯c,k Θ) c,k ∈ ∈ w.r.t. a notion for similarity (or dissimilarity) are exposed similarly as (C.19), but with the difference that in their definition:

w.r.t. the notion for dissimilarity, M M r uses fdiss : ℜ ℜ ℜ+, and (C.20) c,k × → w.r.t. the notion for similarity, M M r¯c,k uses fsim : ℜ ℜ ℜ+, (C.21) × →

with correspondingly defined η and η¯ , where fdiss represents any proper distance measure 1 [13] and fsim = fdiss . −

C.4 Proof that Constrained Likelihood Estimation is MAP

Given rc,k (or r¯c,k), the CL vector is the solution to:

eˆc,k = arg min ρ ec,k,rc,k , subject to m ec,k = 0, (C.22) ec,k Θ ∈   1 Note that fdiss and fsim can be defined also as fdiss(yc,k,yc1,k1 ) = ρ(yc,k,yc1,k1 ) and fsim(yc,k,yc1,k1 ) = ς(yc,k,yc ,k ) ρ(yc,k,yc ,k ). 1 1 − 1 1 C.5 Proof Regarding the Discrimination Density Interpretation 163

where the approximation of rc,k is the maximum a posterior estimation denoted as eˆc,k, ρ(.) is a divergence measure and m(.) is constraint induced by the prior on the likelihood vector. We give the proof in the following.

Given the likelihood p(yc ,k yc,k), using the Bayesian rule, the posterior probability is 1 1 | p(yc,k yc ,k ) = p(yc ,k yc,k)p(yc,k). Consider the tilted likelihood probability | 1 1 1 1 |

exp(λd(yc,k yc1,k1 )) pλ (yc1,k1 yc,k) = p(yc1,k1 yc,k) | (C.23) | | E exp(λd(yc,k yc ,k )) | 1 1   as the posterior p (yc ,k x) = p(yc,k yc ,k ). A proof that tilting trades mean for divergence λ 1 1 | | 1 1 is given by [114].

Given the available estimate p(yc ,k yc,k), for any λ ℜ, p(yc,k yc ,k ) = p (yc ,k yc,k) 1 1 | ∈ | 1 1 λ 1 1 | diverges from the initial estimate p(yc ,k yc,k), since in this case ρ p (yc,k yc ,k ), p(yc ,k yc,k) = 1 1 | λ | 1 1 1 1 | ̸ (.) 0 and ρ is a divergence measure. Equivalently since the prior: 

exp(λd(yc,k yc1,k1 )) p yc,k = | (C.24) E exp(λd(yc,k yc1,k1 ))  |   is a constraint on the posterior p(yc,k yc ,k ) = p (yc ,k yc,k), the prior p(yc,k) induces | 1 1 λ 1 1 | divergence on p(yc ,k yc,k) towards p(yc,k yc ,k ). Having a proper likelihood vector rc,k = 1 1 | | 1 1 p(yc ,k yc,k),..., p(yC,K yc,k) , where rc,k Θ, then the (CL) interpretation as a maximum 1 1 | | ∈ a posterior is seen as a minimum divergence under the prior constraints that equals to a constrained projection:

eˆc,k =argminρ ec,k,rc,k , subject to m ec,k = 0  (C.25) e Θ ∈   C.5 Proof Regarding the Discrimination Density Interpre- tation

To see that ιE l(yc,k,θ ) Dℓ (X) Rℓ ,c(X) + Sℓ (X), (C.26) ∼ 1 − 1 2 or  

ιE l(yc,k,θ ) Dℓ (X) Rℓ ,c(X) + Sℓ (X), (C.27) ∼ 1 − 1 2   164

represents a regularized discrimination density, we assume that we have priors with parame-

ters τ 1,...,τC , ν 1,...,νC , that are modeled as: { } { } 1 p(τ 1,...,τC,yc,k) ∝ exp min ρ(yc,k,τ c) , −β2 1 c1 C  ≤ ≤  (C.28) 1  p(ν 1,...,νC,yc,k) ∝ exp min ρ(yc,k,ν c) . −β2 1 c1 C  ≤ ≤   Then the difference Rℓ ,c(X) Dℓ (X) (or R (X) D (X)) between Rℓ ,c(X) and Dℓ (X) 1 − 1 ℓ1,c − ℓ1 1 1 R ( ) D ( ) discriminative (or ℓ1,c X and ℓ1 X ) can be seen as a finite sample approximation toa density, since: p(τ 1,...,τC,yc,k) E log − p(ν ,...,ν ,y ) ∼  1 C c,k  C K ∑ ∑ min ρ(yc,k,τ c) min ρ(yc,k,ν c) = 1 c1 C − 1 c2 C c=1 k=1 ≤ ≤  ≤ ≤  (C.29) C K ∑ ∑ min max (ρ(yc,k,τ c) + ρ(yc,k,ν c)) w 1 c1 C 1 c2 C c=1 k=1 ≤ ≤ ≤ ≤ Dℓ (X) Rℓ ,c(X)( or D (X) R (X)) 1 − 1 ℓ1 − ℓ1,c where the regularization is defined by the expected strength S (X) (or S (X)) on the support ℓ2 ℓ2 intersection 

C.6 Proof of the Implicit Form in Supervised Case and the Closed Form Solution

+ + M M P Let yc1,k1 = yc ,k + yc−,k ,yc ,k ℜ+ and yc−,k ℜ . Consider the measure Dℓ (X): 1 1 1 1 1 1 ∈ 1 1 ∈ − 1

C K K P + + D (X) = y y 1+ ℓ1 ∑ ∑ ∑ ∑ ∥ c1,k1 ⊙ c2,k2 ∥ c1=1 c2 1,...,C c1 k1=1 k2=1 ∈{{ }\ } C K K y− y− 1 = ∑ ∑ ∑ ∑ ∥ c1,k1 ⊙ c2,k2 ∥ c1=1 c2 1,...,C c1 k1=1 k2=1 ∈{{ }\ } (C.30) C K K y+ T y+ + ∑ ∑ ∑ ∑ | c1,k1 | | c2,k2 | c1=1 c2 1,...,C c1 k1=1 k2=1 ∈{{ }\ } C K K T y− y− . ∑ ∑ ∑ ∑ | c1,k1 | | c2,k2 | c1=1 c2 1,...,C c1 k1=1 k2=1 ∈{{ }\ } C.6 Proof of the Implicit Form in Supervised Case and the Closed Form Solution 165

Let A and Y c1,k1 be given then the problem related to estimation of the NT represen- \{ } tation has only one variable yc1,k1 . In (C.30), yc1,k1 is related with only a part of the NT representations in DP (X), the rest are constants for the reduced problem. Therefore, we ℓ1 have: K y+ T y+ + | c1,k1 | ∑ ∑ | c2,k2 | c2 1,...,C c1 k2=1 ∈{{ }\ } K (C.31) y− y− = | c1,k1 ∑ ∑ | c2,k2 | c2 1,...,C c1 k2=1 ∈{{ }\ } + T + T y d + y− d−, | c1,k1 | 1 | c1,k1 | 1 + K + K where d1 = ∑c2 1,...,C c1 ∑k =1 yc ,k , d1− = ∑c2 1,...,C c1 ∑k =1 yc−,k and we abuse ∈{{ }\ } 2 | 2 2 | ∈{{ }\ } 2 | 2 2 | notation by denoting yc ,k as the vector whose elements are the absolute values of the | 1 1 | elements in y . We have similar construction also for RP (X), where instead of the data c1,k1 ℓ1 samples form the rest of the classes the data samples only from the same class are used, i.e.,

+ + y y 1 + y− y− 1 = ∑ ∥ c,k1 ⊙ c,k2 ∥ ∑ ∥ c,k1 ⊙ c,k2 ∥ k2 1,...,K k1 k2 1,...,K k1 ∈{{ }\ } ∈{{ }\ } + T + T y y + y− y− = ∑ | c,k1 | | c,k2 | ∑ | c,k1 | | c,k2 | k2 1,...,K k1 k2 1,...,K k1 ∈{{ }\ } ∈{{ }\ } (C.32) + T + T y y + y− y− = | c,k1 | ∑ | 2,k2 | | c,k1 | ∑ | c,k2 | k2 1,...,K k1 k2 1,...,K k1 ∈{{ }\ } ∈{{ }\ } + T + T y d + y− d−, | c,k1 | 2 | c,k1 | 2 + + where d2 = ∑k2=k1 yc,k , d2− = ∑k2=k1 yc−,k . ̸ | 2 | ̸ | 2 | Note that:

C K K P 2 S (X) = yc ,k yc ,k = ℓ2 ∑ ∑ ∑ ∑ ∥ 1 1 ⊙ 2 2 ∥2 c1=1 c2 1,...,C c1 k1=1 k2=1 ∈{{ }\ } (C.33) C K K T (yc ,k yc ,k ) (yc ,k yc ,k ). ∑ ∑ ∑ ∑ 1 1 ⊙ 1 1 2 2 ⊙ 2 2 c1=1 c2 1,...,C c1 k1=1 k2=1 ∈{{ }\ } similarly as in (C.42), per one yc1,k1 , we have:

K T (yc ,k yc ,k ) (yc ,k yc ,k ) = ∑ ∑ 1 1 ⊙ 1 1 2 2 ⊙ 2 2 c2 1,...,C c1 k2=1 ∈{{ }\ } (C.34) K T T (yc ,k yc ,k ) ( yc ,k yc ,k ) = (yc ,k yc ,k ) sc, 1 1 ⊙ 1 1 ∑ ∑ 2 2 ⊙ 2 2 1 1 ⊙ 1 1 c2 1,...,C c1 k2=1 ∈{{ }\ } 166

K where sc = ∑c2 1,...,C c1 ∑k =1 yc2,k2 yc2,k2 . Denote qc1,k1 = Axc1,k1 and consider the ∈{{ }\ } 2 ⊙ problem:

2 yˆc,k = argminy qc ,k yc ,k + c1,k1 ∥ 1 1 − 1 1 ∥2 + T + T + T + T T λ0((y ) d + (y− ) d− (y ) d (y− ) d− + (yc ,k yc ,k ) sc)+ (C.35) c1,k1 1 c1,k1 1 − c1,k1 2 − c1,k1 2 1 1 ⊙ 1 1 λ1 yc ,k 1, ∥ 1 1 ∥

by taking the first order derivative w.r.t. yc1,k1 we have that

(yc ,k qc ,k )+ 1 1 − 1 1 + + λ0(sign(y ) d + sign(y− ) d−)+ c1,k1 ⊙ 1 c1,k1 ⊙ 1 + + λ0(sign(y ) d + sign(y− ) d−)+ (C.36) − c1,k1 ⊙ 2 c1,k1 ⊙ 2 λ0(yc ,k sc) 1 1 ⊙

λ1sign(yc1,k1 ) = 0,

take sign magnitude decomposition of yc ,k = sign(yc ,k ) yc ,k then we have 1 1 1 1 ⊙ | 1 1 |

sign(yc ,k ) yc ,k (1 + 2λ0sc) sign(qc ,k ) qc ,k + 1 1 ⊙ | 1 1 | ⊙ − 1 1 ⊙ | 1 1 | + + λ0(sign(yc ,k ) d1 + sign(yc−,k ) d1−)+ 1 1 ⊙ 1 1 ⊙ (C.37) + + λ0(sign(y ) d + sign(y− ) d−)+ − c1,k1 ⊙ 2 c1,k1 ⊙ 2

λ1sign(yc1,k1 ) = 0.

Let the sign of yc1,k1 , i.e. sign(yc1,k1 ) be equal to the sign of sign(qc1,k1 ), and Hadamar

multiply from the left side by sign(qc1,k1 ) then we have

yc ,k (1 + 2λ0sc) qc ,k + | 1 1 | ⊙ − | 1 1 | + + λ0(sign(qc1,k1 ) sign(qc ,k ) d1 + sign(qc1,k1 ) sign(qc−,k ) d1−)+ ⊙ 1 1 ⊙ ⊙ 1 1 ⊙ (C.38) + + λ0(sign(qc ,k ) sign(q ) d + sign(qc ,k ) sign(q− ) d−)+ − 1 1 ⊙ c1,k1 ⊙ 2 1 1 ⊙ c1,k1 ⊙ 2 λ11 = 0, C.7 Proof of the Implicit Form in Unsupervised Case and the Closed Form Solution 167

+ + note that sign(qc ,k ) sign(q ) = sign(q ) and that sign(qc ,k ) sign(q− ) = sign( q− ), 1 1 ⊙ c1,k1 c1,k1 1 1 ⊙ c1,k1 − c1,k1 theretofore we have

yc ,k (1 + 2λ0sc) = qc ,k + | 1 1 | ⊙ | 1 1 | + + λ0(sign(qc ,k ) d1 + sign( qc−,k ) d1−)+ − 1 1 ⊙ − 1 1 ⊙ (C.39) + + λ0(sign(q ) d + sign( q− ) d−) c1,k1 ⊙ 2 − c1,k1 ⊙ 2 − λ11, since the magnitude might be only positive we have that yc ,k (1+2λ0sc) = max( qc ,k | 1 1 |⊙ | 1 1 |− λ (sign(q+ ) d+ + sign( q ) d ) + λ (sign(q+ ) d+ + sign( q ) d ) 0 c1,k1 1 c−1,k1 1− 0 c1,k1 2 c−1,k1 2− ⊙ −+ ⊙+ ⊙ − + ⊙ +− λ11,0). Denote gc = (sign(q ) d +sign( q− ) d−) and vc = (sign(q ) d + c1,k1 ⊙ 1 − c1,k1 ⊙ 1 c1,k1 ⊙ 2 sign( q− ) d−) then the closed form solution to (C.46) is: − c1,k1 ⊙ 2

(Ssupervised) :yc ,k = 1 1 (C.40) sign(Axc ,k ) max( Axc ,k λ0(gc vc) λ11,0) (1 + λ0sc), 1 1 ⊙ | 1 1 | − − − ⊘ which completes the proof 

C.7 Proof of the Implicit Form in Unsupervised Case and the Closed Form Solution

+ + M M P Let yc,k = yc,k + yc−,k,yc,k ℜ+ and yc−,k ℜ . Consider the measure Dℓ (X): ∈ ∈ − 1

D ( ) = ℓ1 X C K C K + + y e (c1 + (k1 1)C)y 1+ ∑ ∑ ∥ c,k ⊙ ∑ ∑ c,k − c1,k1 ∥ c=1 k=1 c1=1 k1=1 (C.41) C K C K + y e (c1 + (k1 1)C)y− 1. ∑ ∑ ∥ c,k ⊙ ∑ ∑ c,k − c1,k1 ∥ c=1 k=1 c1=1 k1=1

Let A and Y c,k be given then problem related to estimation of the NT representation \{ } has only one variable yc,k. Consequently in (C.41), yc,k is repleted with only a part of the transform representations in DP (X), the rest are constants for the reduced problem. In ℓ1 168

particular we have:

C K + + y e (c1 + (k1 1)C)y 1+ ∥ c,k ⊙ ∑ ∑ c,k − c1,k1 ∥ c1=1 k1=1 C K y e (c + (k 1)C)y + c−,k ∑ ∑ c,k 1 1 c−1,k1 1 ∥ ⊙ c =1 k =1 − ∥ 1 1 (C.42) C K + + y e (c + (k 1)C)y 1+ ∑ ∑ ∥ c1,k1 ⊙ c1,k1 − c,k∥ c1=1 k1=1 C K y− e (c + (k 1)C)y− 1, ∑ ∑ ∥ c1,k1 ⊙ c1,k1 − c,k∥ c1=1 k1=1

assuming e (c1 + (k1 1)C) = e (c + (k 1)C) the sum in (C.42) equals to: c1,k1 − c,k − C K + + 2 y e (c1 + (k1 1)C)y 1+ ∥ c,k ⊙ ∑ ∑ c,k − c1,k1 ∥ c1=1 k1=1 C K 2 y− e (c1 + (k1 1)C)y− 1 = ∥ c,k ⊙ ∑ ∑ c,k − c1,k1 ∥ c1=1 k1=1 T C K (C.43) 2 y+ e (c + (k 1)C)y+ + c,k ∑ ∑ c,k 1 1 c1,k1 c =1 k =1 − !   1 1 T C K 2 y e (c + (k 1)C)y = c−,k ∑ ∑ c,k 1 1 c−1,k1 c =1 k =1 − !   1 1 + T + T 2 y d + y− d− , | c,k| 1 | c,k| 1   + C K + where for d = ∑ ∑ e (c1 +(k1 1)C) y and for the positive component d− = 1 c1=1 k1=1 c,k − | c1,k1 | 1 C K e (c + (k 1)C) y . We abuse notation by denoting y as the vector ∑c1=1 ∑k1=1 c,k 1 1 c−1,k1 c,k − | | | | + whose elements are the absolute values of the elements in yc,k. Similarly are constructed d2 and d from RP (X). Concerning SP (X), note that: 2− ℓ1 ℓ2

C K C K 2 S (X) = yc,k e (c1 + (k1 1)C)yc ,k , (C.44) ℓ2 ∑ ∑ ∥ ⊙ ∑ ∑ c,k − 1 1 ∥2 c=1 k=1 c1=1 k1=1

similarly as in (C.42), given Y c,k then the corresponding problem has only one variable \{ } yc,k. Consequently, in (C.44), yc,k is related with only a part of the NT representations in C.7 Proof of the Implicit Form in Unsupervised Case and the Closed Form Solution 169

SP (X), the rest are constants for the reduced problem, in particular, we have: ℓ2

C K e (c + (k )C) 2+ yc,k ∑ ∑ c,k 1 1 1 yc1,k1 2 ∥ ⊙ c =1 k =1 − ∥ 1 1 (C.45) C K 2 yc ,k e (c + (k 1)C)yc,k , ∑ ∑ ∥ 1 1 ⊙ c1,k1 − ∥2 c1=1 k1=1 that equals to:

C K C K T (yc,k yc,k) ( e (ic ,k )e (ic ,k )yc ,k yc ,k )+ ⊙ ∑ ∑ ∑ ∑ c,k 1 1 c,k 2 2 1 1 ⊙ 2 2 c1=1 k1=1 c2=1 k2=1 C K T 2 (C.46) (yc,k yc,k) ( e (ic ,k )yc ,k yc ,k ) = ⊙ ∑ ∑ c,k 1 1 1 1 ⊙ 1 1 c1=1 k1=1 T (yc,k yc,k) s, ⊙ C K C K C K 2 where s = ∑ ∑ ∑ ∑ e (ic ,k )e (ic ,k )yc ,k yc ,k +∑ ∑ e (ic ,k )yc ,k c1=1 k1=1 c2=1 k2=1 c,k 1 1 c,k 2 2 1 1 ⊙ 2 2 c1=1 k1=1 c,k 1 1 1 1 ⊙ yc1,k1 , ic1,k1 = c1 + (k1 1)C and ic2,k2 = c2 + (k2 1)C. − + − + + Denote q = qc,k = Axc,k, use gd = (sign(q ) d1 +sign( q−) d1−), gs = (sign(q ) + ⊙ − ⊙ ⊙ d + sign( q ) d−) and s then we have the following problem: 2 − − ⊙ 2

1 2 T T T T yˆ = argmin q y + λ0 g y gs y + s (y y) + λ11 y , (C.47) y 2∥ − ∥2 d | | − | | ⊙ | |  note that if gT y gT y 0 problem (C.47) is convex. The first order derivative w.r.t. y d | | − s | | ≥ is  y q + λ0 (gd sign(y) gs sign(y)) + λ0y s + λ1sign(y) = 0, (C.48) − ⊙ − ⊙ ⊙ let y = y sign(y), q = q sign(q) and k = (1+λ0s) and assuming that sign(y) = sign(q), | |⊙ | |⊙ then we have

y k q + λ0 (gd gs) + λ11 = 0, (C.49) | | ⊙ − | | − Hadamard divide by left with k we have that

y q k + λ0 (gd gs) k + λ11 k = 0, (C.50) | | − | | ⊘ − ⊘ ⊘ since the magnitude might be only positive we have that y = max( q k λ0 (gd gs) T | | | | ⊘ − − ⊘ k λ11 k,0). Note that if k (y y) < 0 the (C.47) is not convex, on the other hand if any − ⊘ ⊙ of the elements in k is negative, kT (y y) 0 and gT y gT y 0 then the closed form ⊙ ≥ d | | − s | | ≥  170

solution is:

y =sign(q) max( q k λ0 (gd gs) k λ11 k,0), (C.51) ⊙ | | ⊘ − − ⊘ − ⊘ T T Assuming that k e 0 and g y g y 0 then the closed form solution is: ≥ d | | − s | | ≥  (Sunsupervised) : y =sign(q) max( q λ0 (gd gs) λ11,0) k (C.52) ⊙ | | − − − ⊘  Appendix D

D.1 Proof for the Closed Form Solution w.r.t. NT Repre- sentation

Let y = yi and q = Axi and note that per pair τ c ,ν c we have the following problem: { 1 2 } τ T y + c 1 2 τ c1 d T T yˆ = argmin y q 2 + λ0 T | | + (ν c1 ν c1 ) (y y) + λ11 y , (D.1) y 2∥ − ∥ ν y + cs ⊙ ⊙ | | c1 | | ! that represents a projection problem with linear fractional, quadratic and sparsity constraints.

Denote dd = τ c , ds = ν c and s = τ c τ c . 1 2 1 ⊙ 1 First Order Derivative The first order derivative w.r.t. y is

T dd y + cd y q + λ0 T | | 2 dd sign(y)+ − (d y + cs) ⊙ s | | (D.2) 1 λ0 T ds sign(y) + λ0y s + λ1sign(y) = 0, d y + cs ⊙ ⊙ s | | let y = y sign(y), q = q sign(q) and k = (1 + λ0s) note that sign(y) = sign(q), then | | ⊙ | | ⊙ we have 1 dT y + c d d (D.3) y k q + λ0 T dd + λ0 T | | 2 ds + λ11 = 0, | | ⊙ − | | d y + cs (d y + cs) s | | s | | Hadamard divide by k, denote qk = q k, ds,k = ds k, dd,k = dd k and lk = 1 k | | ⊘ | | ⊘ | | ⊘ | | ⊘ then we have

dT y + c 1 d d (D.4) y qk + λ0 T | | 2 ds,k + λ0 T dd,k + λ1lk = 0, | | − | | (d y + cs) d y + cs s | | s | | 172

Solving for a = dT y and b = dT y Multiply one time with dT and another time with dT s | | d | | s d then we have

dT y + c 1 T T d d T T T (D.5) ds y ds qk + λ0 T | | 2 ds,kds + λ0 T dd,kds + λ1lk ds = 0, | | − | | (d y + cs) d y + cs s | | s | | dT y + c 1 T T d d T T T (D.6) dd y dd qk + λ0 T | | 2 ds,kdd + λ0 T dd,kdd + λ1lk dd = 0, | | − | | (d y + cs) d y + cs s | | s | | denote a = dT y and b = dT y and reorder (D.5) and (D.6) then we have s | | d | |

T b + cd T 1 T T a ds qk + λ0 2 ds,kds + λ0 dd,kds + λ1lk ds = 0, − | | (a + cs) a + cs 2 (D.7) 1 a + cs T 1 T T b + cd = qk ds a λ0 d ds λ1l ds = 0, λ dT d | | − − a + c d,k − k 0 s,k s s  T b + cd T 1 T T b dd qk + λ0 2 ds,kdd + λ0 dd,kdd + λ1lk dd = 0, (D.8) − | | (a + cs) a + cs replace (D.7) in (D.8) and reorder then we have

4 3 2 c4a + c3a + c2a + c1a + c0 = 0 (D.9)

where

c4 =1,

c3 =3cs [v1 v4], − − 2 c2 =3c 3cs(v1 v4) + v3 + v6, s − − 3 2 c1 = ( cs + 3cs (v1 v4) 2v3cs − − − − (D.10) v6cs + v2v5 v2cs + v1v6 v4v6), − − − 3 2 c0 = ((v1 v4)c v3c + − − s − s cs[v2v5 v2cs + v2v8 + v1v6 v4v6]+ − − v2v7 v3v6), − D.2 Proof for the Closed Form Solution w.r.t. Similarity Related Parameter 173

and T v1 = qk dd, | | T v2 =λ0ds,kdd, T v3 =λ0dd,kdd, T v4 =λ1l dd, k (D.11) T v5 = qk ds, | | T v6 =λ0ds,kds, T v7 =λ0dd,kds, T v8 =λ1lk ds,

Closed Form Solution Let (D.9) have a global minimum for a, use it with (D.7) to resolve for b. Assuming the solution for a and b are positive, a >= 0 and b >= 0, we use (D.4), the fact that sign(y) = sign(q) and then the global minimum solution to (D.1) is:

1 b + cd y = sign(q) max( q λ0( dd + ds) λ11,0 k, (D.12) ⊙ | |− a + c (a + c )2 − ⊘ s s  where k = (1 + λ0s) 

D.2 Proof for the Closed Form Solution w.r.t. Similarity Related Parameter

Given A,Y,θ d = θ c1 and using (Ass), the corresponding problem, per τ c1 and a pair \ τ c ,ν c reduces to: { 3 4 } 1 τˆ = argmin Ax ν τ 2 + r(i)+ τ c1 ∑ i ν z2(i) τ c1 2 ∑ τ c1 2∥ − − ∥ i:z1(i)==c1 i:z1(i)==c1 ∀ } ∀ } T (D.13) τ τ c + cd c3 | 1 | T λE T + (τ c3 τ c3 ) (τ c1 τ c1 ) , ν τ c + cs ⊙ ⊙ c4 | 1 | ! where 1 r(i) = ρ(yi,τ z1(i)) + ς(yi,τ z1(i)). (D.14) ρ(yi,ν z2(i))

Let y = τ c1 and: 1 q = Ax ν . ∑ i ν z2(i) (D.15) ∑ i z (i)==c 1 − : 1 1 i:z1(i)==c1 ∀ ∀ 174

Note that per pair τ c3,ν c3 we have the following problem: { } T 1 τ y + cd 2 c3 | | T yˆ =argmin y q 2 + λE T + (τ c3 τ c3 ) (y y) + y 2∥ − ∥ ν y + cs ⊙ ⊙ c4 | | ! (D.16) λ0 ∑ r(i). i:z1(i)==c1 ∀

Denote dd = τ c3 , ds = ν c4 ,

ς(yi,τ c1 ) s = λEτ c3 τ c3 + λ0 ∑ , (D.17) ⊙ ∂ττ c i:z1(i)==c1 1 ∀ and 1 ∂ρ(yi,τ c ) t = ∑ 1 , (D.18) ρ(yi,ν ) ∂ττ c i:zi==c1 z2(i) 1 ∀ The First Order Derivative The first order derivative w.r.t. y is

T dd y + cd y q + λE T | | 2 dd sign(y)+ − (d y + cs) ⊙ s | | (D.19) 1 λE T ds sign(y) + y s + λ0t sign(y) = 0, d y + cs ⊙ ⊙ ⊙ s | | Closed Form Solution Note that (D.2) is identical to (D.19) the difference is that for the

λ0 and λ1 we have λE and λ0, instead of 1 sign(y) we have t sign(y) and the vector s ⊙ ⊙ is different. Therefore, under similar assumptions to (D.1) the global minimum solution to (D.14) is:

1 b + cd y = sign(q) max q λE dd + ds λ0t,0 k, (D.20) ⊙ | | − a + c (a + c )2 − ⊘   s s   where a and b are estimated identically as in Appendix A using the corresponding variables related to (D.19), and k = (1 + 2s) 

D.3 Proof for the Solution w.r.t. Dissimilarity Related Pa- rameter

Given A,Y,θ d = θ c2 and using (Ass), the corresponding problem per ν c2 and a pair \ τ c ,ν c reduces to: { 3 4 } D.3 Proof for the Solution w.r.t. Dissimilarity Related Parameter 175

1 νˆ = argmin Ax τ ν 2 + r(i)+ ν c2 ∑ i τ z1(i) ν c2 2 λ0 ∑ ν c2 2 ∥ − − ∥ i:z2(i)==c2 i:z2(i)==i ∀ ∀ T (D.21) τ ν c + cd c3 | 2 | T λE T + (τ c3 τ c3 ) (ν c2 ν c2 ) , ν ν c + cs ⊙ ⊙ c4 | 2 | !

1 where r(i) = ρ(yi,τ z1(i)) . ρ(yi,ν c2 ) Let y = ν c2 and 1 q = Ax τ . ∑ i τ z1(i) (D.22) ∑ i z (i)=c 1 − : 2 2 i:z2(i)==c2 ∀ ∀ Note that per pair τ c ,ν c we have the following problem: { 3 4 }

1 2 yˆ = argmin y q + λ0 r(i)+ y 2∥ − ∥2 ∑ i:z2(i)==c2 ∀ T (D.23) τ y + cd c3 | | T λE T + (τ c3 τ c3 ) (y y) , ν y + cs ⊙ ⊙ c4 | | !

Denote dd = τ c , ds = ν c , s = τ c τ c and 3 4 3 ⊙ 3 ∂r(i) t = ∑ , (D.24) ∂νν c i:z2(i)==c2 2 ∀ The First Order Derivative The first order derivative w.r.t. y is

T dd y + cd y q + λE T | | 2 dd sign(y)+ − (d y + cs) ⊙ s | | 1 (D.25) λE T ds sign(y)+ d y + cs ⊙ s | | λEy s + λ0t sign(y) = 0, ⊙ ⊙ Iterative Solution Note that

∂r(i) ρ(yi,τ z (i)) = 1 e sign(ν ) (D.26) T 2 i ν c2 ∂νν c (e ν c + cs) ⊙ 2 i | 2 | where + ei = max(yi,0) sign(ν ) + max( yi,0) sign( ν − ). (D.27) ⊙ c2 − ⊙ − c2 T Using a similar approach as in Appendix A we can introduce new variables for each e ν c . i | 2 | In this way we will have a number of variables and nonlinear equations equal to 2 plus the 176

number of yi in the sum ∑ i:z2(i)==c2 r(i) that have to be solved. One way to solve such a ∀ system of equations is by using iterative alternating algorithm where we fix all the variable except one and solve for that variable. Instead, we take all the currently estimated θ t 1 = ν t 1,...,ν t 1 and compute all the 2− 1− C−1 ρ(y ,τ ) { } i z1(i) respective hi = T t 1 2 and then we approximate (D.26) as: (e ν c− +cs) i | 2 | ∂r(i) = hiei sign(ν c2 ), (D.28) ∂νν c2 ⊙

ν = νt then we iterate using the closed form approximate solution for ν c2 ν c2 that is given in the following. Closed Form Approximate Solution Note that (D.2) is identical to the approximation of − (D.25) with (D.28). The difference is that for the λ0 and λ1 we have λE and λ0, and instead of 1 sign(y) we have λ0t sign(y). Therefore, under similar assumptions related to the ⊙ ⊙ solution for (D.1) , the global minimum solution to the approximation of (D.21) with (D.28) is:

1 b + cd y = sign(q) max q λE dd + ds λ0t,0 k, (D.29) ⊙ | | − a + c (a + c )2 − ⊘   s s   where a and b are estimated identically as in Appendix A using the corresponding variables related to (D.25), the approximation (D.28), and k = (1 + 2λEs)  References

[1] A. Schmidt A. Kasinski, A. Florek. The put database. Technical Report 3, CIE Bioinformatics, 2008.

[2] M. Aharon, M. Elad, and A. Bruckstein. Svdd: An algorithm for designing over- complete dictionaries for sparse representation. Trans. Sig. Proc., 54(11):4311–4322, November 2006.

[3] M. Akcakaya and V. Tarokh. A frame construction and a universal distortion bound for sparse representations. Trans. Sig. Proc., 56(6):2443–2450, June 2008. ISSN 1053-587X.

[4] Afsaneh Asaei, Hervé Bourlard, and Volkan Cevher. Model-based compressive sensing for multi-party distant speech recognition. Idiap-RR Idiap-RR-04-2011, Idiap, 3 2011.

[5] Afsaneh Asaei, Mohammad Golbabaee, Herve Bourlard, and Volkan Cevher. Struc- tured sparsity models for reverberant speech separation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(3):620–633, March 2014.

[6] Y. Bresler B. Wen and S. Ravishankar. FRIST - flipping and rotation invariant sparsifying transform learning and applications. CoRR, abs/1511.06359, 2015.

[7] Francis R Bach and Zaïd Harchaoui. Diffrac: a discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems, pages 49–56, 2008.

[8] Francis R. Bach and Michael I. Jordan. Learning spectral clustering, with application to speech separation. JMLR, 7:1963–2001, 2006.

[9] Mayank Bakshi, Sidharth Jaggi, Sheng Cai, and Minghua Chen. SHO-FA: robust compressive sensing with order-optimal complexity, measurements, and bits. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing, 178 References

Allerton Park & Retreat Center, Monticello, IL, USA, October 1-5, 2012, pages 786– 793. IEEE, 2012. ISBN 978-1-4673-4537-8. doi: 10.1109/Allerton.2012.6483298. URL http://dx.doi.org/10.1109/Allerton.2012.6483298.

[10] Pierre Baldi. Autoencoders, unsupervised learning and deep architectures. In Proceed- ings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop - Volume 27, UTLW’11, pages 37–50. JMLR.org, 2011.

[11] Richard G. Baraniuk, Volkan Cevher, Marco F. Duarte, and Chinmay Hegde. Model- based compressive sensing. IEEE Trans. Inf. Theor., 56(4):1982–2001, April 2010.

[12] Daniele Barchiesi and Mark Plumbley. Learning incoherent dictionaries for sparse approximation using iterative projections and rotations. IEEE Transactions on signal processing, 61(8):2055–2065, 2013.

[13] Heinz Bauer. Measure and integration theory, volume 26 of de Gruyter Studies in Mathematics. Walter de Gruyter & Co., Berlin, 2001. Translated from the German by Robert B. Burckel.

[14] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, March 2009. ISSN 1936-4954. doi: 10.1137/080716542.

[15] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 2012. URL http://arxiv.org/abs/1206.5538.

[16] Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global Optimization, 16(1):23–32, 2000.

[17] Emmanuel Cades and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n, 2007.

[18] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S. Huang. Graph regularized nonneg- ative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell., 33(8):1548–1560, August 2011.

[19] Sijia Cai, Wangmeng Zuo, Lei Zhang, Xiangchu Feng, and Ping Wang. Support vector guided dictionary learning. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, pages 624–639, 2014. References 179

[20] D. Child. The Essentials of Factor Analysis. Bloomsbury Academic, 2006.

[21] Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 895–903, New York, NY, USA, 2011. ACM.

[22] A. Coates and A. Y Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 921–928, 2011.

[23] Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in data streams. Proc. VLDB Endow., 1(2):1530–1541, August 2008. ISSN 2150-8097. doi: 10.14778/ 1454159.1454225. URL http://dx.doi.org/10.14778/1454159.1454225.

[24] Graham Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. In Paola Flocchini and Leszek Gasieniec, editors, SIROCCO, volume 4056 of Lecture Notes in Computer Science, pages 280–294. Springer, 2006. ISBN 3-540- 35474-3.

[25] T. Cover and P. Hart. Nearest neighbor pattern classification. , IEEE Transactions on, 13(1):21–27, 1967.

[26] T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):21–27, 1967.

[27] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006. ISBN 0471241954.

[28] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transform-domain collaborative filtering. IEEE TRANS. IMAGE PROCESS, 16(8): 2080, 2007.

[29] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing: Closing the gap between performance and complexity. CoRR, abs/0803.0811, 2008.

[30] Wei Dai, Olgica Milenkovic, and Hoa Vinh Pham. Structured sublinear compressive sensing via dense belief propagation. CoRR, abs/1101.3348, 2011. URL http://arxiv. org/abs/1101.3348. 180 References

[31] Samuel I Daitch, Jonathan A Kelner, and Daniel A Spielman. Fitting a graph to vector data. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 201–208. ACM, 2009.

[32] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. CVPR, 2005.

[33] Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors. Kernel Sparse Rep- resentation for Image Classification and Face Recognition, volume 6314 of Lecture Notes in Computer Science, 2010. Springer. ISBN 978-3-642-15560-4.

[34] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: Spectral cluster- ing and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 551–556, New York, NY, USA, 2004. ACM.

[35] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.

[36] J. Dong, W. Wang, W. Dai, M. D. Plumbley, Zi-Fa Han, and J. A. Chambers. Analysis simco algorithms for sparse analysis model based dictionary learning. IEEE Trans. Signal Processing, 64(2):417–431, 2016.

[37] D. L. Donoho, Y. Tsaig, I. Drori, and J. L. Starck. Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Trans. Inf. Theor., 58(2):1094–1121, February 2012. ISSN 0018-9448.

[38] David L. Donoho, Arian Maleki, and Andrea Montanari. Message passing algorithms for compressed sensing. CoRR, abs/0907.3574, 2009. URL http://arxiv.org/abs/0907. 3574.

[39] David L. Donoho, Adel Javanmard, and Andrea Montanari. Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing. CoRR, abs/1112.0708, 2011. URL http://arxiv.org/abs/1112.0708.

[40] M. F. Duarte and Y. C. Eldar. Structured compressed sensing: From theory to applications. Trans. Sig. Proc., 59(9):4053–4085, September 2011.

[41] Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790–2797. IEEE, 2009. References 181

[42] Ehsan Elhamifar and René Vidal. Robust classification using structured sparse repre- sentation. In CVPR, pages 1873–1879. IEEE, 2011.

[43] F. Farhadzadeh and S. Voloshynovskiy. Active content fingerpriting. Information Forensics and Security, IEEE Transactions on, June 2014.

[44] Simon Foucart. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM J. Numer. Anal., 49(6):2543–2563, December 2011. ISSN 0036-1429. doi: 10.1137/100806278.

[45] Mehrdad J. Gangeh, Ahmed K. Farahat, Ali Ghodsi, and Mohamed S. Kamel. Super- vised dictionary learning and sparse representation-a review. CoRR, abs/1502.05928, 2015. URL http://arxiv.org/abs/1502.05928.

[46] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illu- mination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643–660, 2001.

[47] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illu- mination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643–660, 2001.

[48] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[49] I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Trans. Sig. Proc., 45(3):600–616, March 1997. ISSN 1053-587X. doi: 10.1109/78.558475. URL http://dx.doi.org/10. 1109/78.558475.

[50] Benjamin Graham. Fractional max-pooling. CoRR, 2014.

[51] Michael Grant and Stephen Boyd. Cvx: Matlab software for disciplined convex programming, version 2.1, March 2014.

[52] Huimin Guo, Zhuolin Jiang, and Larry S. Davis. Discriminative dictionary learning with pairwise constraints. In Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I, pages 328–342, 2012. 182 References

[53] Huimin Guo, Zhuolin Jiang, and Larry S. Davis. Discriminative Dictionary Learning with Pairwise Constraints, pages 328–342. Springer Berlin Heidelberg, Berlin, Hei- delberg, 2013. ISBN 978-3-642-37331-2. doi: 10.1007/978-3-642-37331-2_25. URL https://doi.org/10.1007/978-3-642-37331-2_25.

[54] Xiaojie Guo. Robust subspace segmentation by simultaneously learning data repre- sentations and their affinity matrix. In IJCAI, pages 3547–3553, 2015.

[55] Jarvis Haupt and Richard G. Baraniuk. Robust support recovery using sparse compres- sive sensing matrices. In CISS, pages 1–6. IEEE, 2011. ISBN 978-1-4244-9846-8.

[56] S. Hawe, M. Kleinsteuber, and K. Diepold. Analysis operator learning and its ap- plication to image reconstruction. IEEE Trans. Image Processing, 22(6):2138–2150, 2013.

[57] Geoffrey E. Hinton. Boltzmann machines. In Encyclopedia of Machine Learning and Data Mining, pages 164–168. 2017.

[58] Stephen Howard, Stephen Searle, and Robert Calderbank. Chirp sensing codes: Deterministic compressed sensing measurements for fast recovery. In Applied and Computational Harmonic Analysis, 2009.

[59] Calderbank A. R. Howard S. D. and Searle S. J. A fast reconstruction algorithm for deterministic compressive sensing using second order reed-muller codes. Information Sciences and Systems, 42:11, 2008.

[60] P. O. Hoyer and P. Dayan. Non-negative matrix factorization with sparseness con- straints. Journal of Machine Learning Research, 5:1457–1469, 2004.

[61] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Trans. on Neural Networks, 2:415–425, 2002.

[62] Junzhou Huang, Xiaolei Huang, and Dimitris N. Metaxas. Learning with dynamic group sparsity. In ICCV, pages 64–71. IEEE, 2009.

[63] Eric Hunsberger and Chris Eliasmith. Spiking deep networks with LIF neurons. 2015.

[64] Piotr Indyk, Hung Q. Ngo, and Atri Rudra. Efficiently decodable non-adaptive group testing. In Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’10, pages 1126–1142, Philadelphia, PA, USA, 2010. Society for Industrial and . ISBN 978-0-898716-98-6. URL http://dl.acm.org/citation.cfm?id=1873601.1873692. References 183

[65] Rodolphe Jenatton, Guillaume Obozinski, and Francis R. Bach. Structured sparse principal component analysis. In AISTATS, pages 366–373, 2010.

[66] Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. Structured variable selec- tion with sparsity-inducing norms. J. Mach. Learn. Res., 12:2777–2824, November 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2078194.

[67] Rui Jiang, Hong Qiao, and Bo Zhang. Efficient fisher discrimination dictionary learning. Signal Process., 128(C):28–39, November 2016. ISSN 0165-1684. doi: 10.1016/j.sigpro.2016.03.013. URL http://dx.doi.org/10.1016/j.sigpro.2016.03.013.

[68] Xudong Jiang and Jian Lai. Sparse and dense hybrid representation via dictionary decomposition for face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(5): 1067–1079, 2015. doi: 10.1109/TPAMI.2014.2359453. URL http://dx.doi.org/10. 1109/TPAMI.2014.2359453.

[69] Zhuolin Jiang, Zhe Lin, and Larry S. Davis. Label consistent K-SVD: learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35 (11):2651–2664, 2013. doi: 10.1109/TPAMI.2013.88. URL http://dx.doi.org/10.1109/ TPAMI.2013.88.

[70] Sofia Karygianni and Pascal Frossard. Structured sparse coding for image denoising or pattern detection. In ICASSP, Florence, Italy, 2014.

[71] Byung Soo Kim, Jae Young Park, Anush Mohan, Anna Gilbert, and Silvio Savarese. Hierarchical classification of images by sparse approximation. In BMVC, 2011.

[72] Seyoung Kim and Eric P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML, pages 543–550. Omnipress, 2010.

[73] Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Learning robust graph regularisation for subspace clustering. In BMVC, 2016.

[74] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1):59–69, 1982.

[75] Shu Kong and Donghui Wang. A dictionary learning approach for classification: Separating the particularity and the commonality. In Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, ECCV (1), volume 7572 of Lecture Notes in Computer Science, pages 186–199. Springer, 2012. ISBN 978-3-642-33717-8. 184 References

[76] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems, 2009.

[77] Dimche Kostadinov, Sviatoslav Voloshynovskiy, Maurits Diephuis, Sohrab Ferdowsi, and Taras Holotyak. On local active content fingerprinting: solutions for general linear feature maps. In ICPR, Cancun, Mexico, December, 2-4 2016.

[78] Dimche Kostadinov, Sviatoslav Voloshynovskiy, Maurits Diephuis, and Taras Holotyak. Local active content fingerprinting: optimal solution under linear modula- tion. In ICIP, Phoenix, USA, September, 25-28 2016.

[79] Dimche Kostadinov, Sviatoslav Volshinovsky, and Sohrab Ferdowsi. Learning non- structured, overcomplete and sparsifying transform. In Lisbon, Portugal, SPARS 2017, June 2017.

[80] Andreas Krause, Pietro Perona, and Ryan G Gomes. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, pages 775–783, 2010.

[81] Jian Lai and Xudong Jiang. Modular weighted global sparse representation for robust face recognition. IEEE Signal Process. Lett., 19(9):571–574, 2012. URL http://dblp.uni-trier.de/db/journals/spl/spl19.html#LaiJ12.

[82] Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits. URL http://yann.lecun.com/exdb/mnist/.

[83] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, pages 97–104, Washington, DC, USA, 2004. IEEE Computer Society.

[84] Yann LeCun, David G. Lowe, Jitendra Malik, Jim Mutch, Pietro Perona, and Tomaso Poggio. Object Recognition, Computer Vision, and the Caltech 101: A Response to Pinto et al. Technical report, March 2008.

[85] Chen-Yu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In 19th ICAIS, 2016.

[86] Yang Liu, Wei Chen, Qingchao Chen, and Ian J. Wassell. Support discrimination dictionary learning for image classification. In Computer Vision - ECCV 2016 - 14th References 185

European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part II, pages 375–390, 2016.

[87] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and efficient subspace segmentation via least squares regression. In European conference on computer vision, pages 347–360. Springer, 2012.

[88] Can-Yi Lu, Hai Min, Jie Gui, Lin Zhu, and Ying-Ke Lei. Face recognition via weighted sparse representation. Journal of Visual Communication and Image Representation, 24(2):111–116, 2013.

[89] Canyi Lu, Jiashi Feng, Zhouchen Lin, and Shuicheng Yan. Correlation adaptive subspace segmentation by trace lasso. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1345–1352. IEEE, 2013.

[90] Elad M., Milanfar P., and R. Rubinstein. Analysis versus synthesis in signal priors. Inverse Problems, 23(3):947–968, June 2007.

[91] Yi Ma, Harm Derksen, Wei Hong, and John Wright. Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence, 29(9), 2007.

[92] Yi Ma, Allen Y Yang, Harm Derksen, and Robert Fossum. Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM review, 50(3):413–458, 2008.

[93] Alireza Makhzani and Brendan J Frey. Winner-take-all autoencoders. In NIPS. 2015.

[94] S.G. Mallat and Zhifeng Z. Matching pursuits with time-frequency dictionaries. Trans. Sig. Proc., 41(12):3397–3415, December 1993. ISSN 1053-587X. doi: 10.1109/78. 258082. URL http://dx.doi.org/10.1109/78.258082.

[95] Reid Mark. Generalization Bounds, pages 447–454. Springer US, Boston, MA, 2010.

[96] A. Martínez and R. Benavente. The ar face database. Technical Report 24, Computer Vision Center, Jun 1998.

[97] A. Martínez and R. Benavente. The ar face database. Technical Report 24, Computer Vision Center, Jun 1998. URL "http://www.cat.uab.cat/Public/Publications/1998/ MaB1998". 186 References

[98] G. J. McLachlan and D. Peel. Finite mixture models. Wiley Series in Probability and Statistics, New York, 2000.

[99] Lukas Meier, Sara Van De Geer, Peter Bühlmann, and Eidgenössische Technis- che Hochschule Zürich. The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B, 2008.

[100] L. Mirsky. On the trace of matrix products. 20(3-6):171–174, 1959. ISSN 2167-3888.

[101] Maryam Najafian. Acoustic model selection for recognition of regional accented speech. July 2016. URL http://etheses.bham.ac.uk/6461/.

[102] S. Nam, M. E. Davies, M. Elad, and R. Gribonval. The cosparse analysis model and algorithms. CoRR, abs/1106.4987, 2011.

[103] D. Needell and J. A. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Technical report, California Institute of Technology, Pasadena, 2008.

[104] Deanna Needell and Roman Vershynin. Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Technical report, 2007.

[105] Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20. Technical report, 1996.

[106] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Wor. UFL, 2011.

[107] J. Von Neumann. Some matrix-inequalities and metrization of matrix-space. Tomskii Univ. Rev., 1:286–300, 1937.

[108] D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2161–2168, June 2006. oral presentation.

[109] Priyadarshini Panda and Kaushik Roy. Unsupervised regenerative learning of hierar- chical features in spiking deep networks for object recognition. CoRR, 2016.

[110] Neal Parikh and Stephen Boyd. Proximal algorithms. Found. Trends Optim., 1(3): 127–239, January 2014. ISSN 2167-3888. References 187

[111] Farzad Parvaresh and Babak Hassibi. Explicit measurements with almost opti- mal thresholds for compressed sensing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30 - April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA, pages 3853–3856. IEEE, 2008. ISBN 1-4244-1484-9. doi: 10.1109/ICASSP.2008.4518494. URL http://dx.doi.org/10.1109/ICASSP.2008.4518494.

[112] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.

[113] P. Jonathon Phillips, Hyeonjoon M., Syed A. R., and P. J. Rauss. The feret evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 22(10):1090–1104, Oct. 2000. ISSN 0162-8828. doi: 10.1109/34.879790.

[114] Y. Polyanskiy and Y. Wu. Lecture notes on Information Theory. MIT (6.441), UIUC (ECE 563), Yale (STAT 664), Massachusetts, USA, 2012-2017.

[115] Y. Polyanskiy and Y. Wu. Lecture notes on information theory, 2014.

[116] Javier Portilla. Image restoration through l0 analysis-based sparse optimization in tight frames. In Proceedings of the International Conference on Image Processing, ICIP 2009, November 07-10, Cairo, Egypt, 2009.

[117] Rubinstein R., Peleg T., and M. Elad. Analysis k-svd: A dictionary-learning algorithm for the analysis sparse model. IEEE TSP, pages 661–677, January 2013.

[118] I. Ramirez, P. Sprechmann, and G. Sapiro. Classification and clustering via dictionary learning with structured incoherence and shared features. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3501–3508, June 2010. doi: 10.1109/CVPR.2010.5539964.

[119] Saiprasad Ravishankar and Yoram Bresler. Learning sparsifying transforms for image processing. In 19th IEEE International Conference on Image Processing, ICIP 2012, Lake Buena Vista, Orlando, FL, USA, September 30 - October 3, 2012, pages 681–684. IEEE, 2012. ISBN 978-1-4673-2534-9. doi: 10.1109/ICIP.2012.6466951. URL http://dx.doi.org/10.1109/ICIP.2012.6466951.

[120] Saiprasad Ravishankar and Yoram Bresler. Learning overcomplete sparsifying trans- forms for signal processing. In IEEE International Conference on Acoustics, Speech 188 References

and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 3088–3092, 2013.

[121] Saiprasad Ravishankar and Yoram Bresler. Closed-form solutions within sparsifying transform learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 5378–5382. IEEE, 2013. doi: 10.1109/ICASSP.2013.6638690. URL http://dx.doi.org/ 10.1109/ICASSP.2013.6638690.

[122] Saiprasad Ravishankar and Yoram Bresler. Doubly sparse transform learning with convergence guarantees. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 5262–5266, 2014.

[123] Saiprasad Ravishankar and Yoram Bresler. $\ell_0$ sparsifying transform learning with efficient optimal updates and convergence guarantees. CoRR, abs/1501.02859, 2015. URL http://arxiv.org/abs/1501.02859.

[124] Steffen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 81–90, New York, NY, USA, 2010. ACM.

[125] Ron Rubinstein and Michael Elad. Dictionary learning for analysis-synthesis thresh- olding. IEEE Trans. Signal Processing, 62(22):5962–5972, 2014.

[126] Ron Rubinstein, Alfred M. Bruckstein, and Michael Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045–1057, 2010.

[127] Ron Rubinstein, Tomer Peleg, and Michael Elad. Analysis K-SVD: A dictionary- learning algorithm for the analysis sparse model. IEEE Trans. Signal Processing, 61 (3):661–677, 2013.

[128] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011.

[129] Pinsker M. S. and Feinstein Amiel. Information and information stability of random variables and processes, 1964. URL http://opac.inria.fr/record=b1082909. References 189

[130] Qinfeng S., A. Eriksson, A. van den Hengel, and Chunhua S. Is face recognition really a compressive sensing problem? 2013 IEEE Conference on Computer Vision and Pattern Recognition, 0:553–560, 2011. doi: http://doi.ieeecomputersociety.org/10. 1109/CVPR.2011.5995556.

[131] Zhang S., Wang K., Chen B., and Huang X. A new framework for co-clustering of gene expression data. Pattern Recognition in Bioinformatics, 7036, 2011.

[132] F. S. Samaria, F. S. Samaria *t, A.C. Harter, and Old Addenbrooke’s Site. Parameteri- sation of a stochastic model for human face identification, 1994.

[133] Shriram Sarvotham, Dror Baron, and Richard G. Baraniuk. Sudocodes – fast mea- surement and reconstruction of sparse signals. In IEEE International Symposium on Information Theory – ISIT, Seattle, Jul. 2006.

[134] Igal Sason. On reverse pinsker inequalities. CoRR, abs/1503.07118, 2015. URL http://arxiv.org/abs/1503.07118.

[135] Gerald Schaefer and Michal Stich. Ucid: an uncompressed color image database. In Electronic Imaging 2004, pages 472–480. International Society for Optics and Photonics, 2003.

[136] K. Schnass. Convergence radius and sample complexity of ITKM algorithms for dictionary learning. CoRR, abs/1503.07027, 2015.

[137] S. Shan, W. Gao, and D. Zhao. Face recognition based on face-specific subspace. International journal of imaging systems and technology, 13(1):23–32, 2003.

[138] Sumit Shekhar, Vishal M. Patel, and Rama Chellappa. Analysis sparse coding models for image-based classification. In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27-30, 2014, pages 5207–5211, 2014.

[139] Maheshkumar H. Kolekar Shobhit Bhatnagar, Deepanway Ghosal. Classification of fashion article images using convolutional neural networks. In IEEE ICIIP, 2017.

[140] Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk. Scalable collab- orative filtering approaches for large recommender systems. J. Mach. Learn. Res., 10: 623–656, June 2009.

[141] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. 190 References

[142] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.

[143] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE TRANS. INFORM. THEORY, 53:4655–4666, 2007.

[144] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. ISBN 0-387-94559-8.

[145] Juha Vesanto and Esa Alhoniemi. Clustering of the self-organizing map. IEEE Transactions on neural networks, 11(3):586–600, 2000.

[146] René Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.

[147] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11: 3371–3408, 2010.

[148] T. H. Vu, H. S. Mousavi, V. Monga, U. K. A. Rao, and G. Rao. Dfdl: Discriminative feature-oriented dictionary learning for histopathological image classification. In 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), pages 990–994, April 2015. doi: 10.1109/ISBI.2015.7164037.

[149] Tiep Huu Vu and Vishal Monga. Fast low-rank shared dictionary learning for image classification. CoRR, abs/1610.08606, 2016. URL http://arxiv.org/abs/1610.08606.

[150] Jing Wang, Canyi Lu, Meng Wang, Pei-Pei Li, Shuicheng Yan, and Xuegang Hu. Robust face recognition via adaptive sparse representation. CoRR, abs/1404.4780, 2014. URL http://arxiv.org/abs/1404.4780.

[151] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In CVPR, pages 3360–3367. IEEE, 2010.

[152] B. Wen, S. Ravishankar, and Y. Bresler. Structured overcomplete sparsifying transform learning with convergence guarantees and applications. International Journal of Computer Vision, 114(2-3):137–167, 2015. References 191

[153] Bihan Wen, Saiprasad Ravishankar, and Yoram Bresler. Learning overcomplete sparsifying transforms with block cosparsity. In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27-30, 2014, pages 803–807. IEEE, 2014. ISBN 978-1-4799-5751-4. doi: 10.1109/ICIP.2014.7025161. URL http://dx.doi.org/10.1109/ICIP.2014.7025161.

[154] Anja Wille and Peter Buhlmann. Low-order conditional independence graphs for inferring genetic networks. In Statistical Applications in Genetics and Molecular Biology, 5.1, pages 121–136. DE GRYTIER, 2018.

[155] Yongkang Wong, Mehrtash Tafazzoli Harandi, and Conrad Sanderson. On robust face recognition via sparse encoding: the good, the bad, and the ugly. CoRR, abs/1303.1624, 2013. URL http://arxiv.org/abs/1303.1624.

[156] J. Wright, A. Y Yang, A. Ganesh, S. S Sastry, and Yi Ma. Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210–227, 2009.

[157] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, 2017.

[158] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.

[159] Jingjing Xiao, Linbo Qiao, Rustam Stolkin, and Ales Leonardis. Distractor-supported single target tracking in extremely cluttered scenes. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 121–136. Springer, 2016.

[160] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In Advances in neural information processing systems, pages 1537–1544, 2005.

[161] M. Yaghoobi, S. Nam, R. Gribonval, and M. E. Davies. Constrained overcomplete anal- ysis operator learning for cosparse signal modelling. IEEE Trans. Signal Processing, 61(9):2341–2355, 2013.

[162] M. Yang, L. Zhang, X. Feng, and D. Zhang. Fisher discrimination dictionary learning for sparse representation. In 2011 International Conference on Computer Vision, pages 543–550, Nov 2011. doi: 10.1109/ICCV.2011.6126286. 192 References

[163] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Fisher discrimination dictionary learning for sparse representation. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 543–550, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-4577-1101-5. doi: 10.1109/ICCV.2011. 6126286. URL http://dx.doi.org/10.1109/ICCV.2011.6126286.

[164] Ming Yin, Junbin Gao, and Zhouchen Lin. Laplacian regularized low-rank repre- sentation and its applications. IEEE transactions on pattern analysis and machine intelligence, 38(3):504–517, 2016.

[165] F. Zhang and H. D. Pfister. Compressed sensing and linear codes over real numbers. In UCSD Workshop Information Theory Apps., February 2008.

[166] Zi-Ke Zhang, Tao Zhou, and Yi-Cheng Zhang. Tag-aware recommender systems: A state-of-the-art survey. CoRR, abs/1202.5820, 2012.

[167] Peng Zhao, Guilherme Rocha, and Bin Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist, page 2009, 2009.

[168] Miao Zheng, Jiajun Bu, Chun Chen, Can Wang, Lijun Zhang, Guang Qiu, and Deng Cai. Graph regularized sparse coding for image representation. IEEE Transactions on Image Processing, 20(5):1327–1336, 2011.