CS229T/STAT231: Statistical Learning Theory (Winter 2016) Contents

Total Page:16

File Type:pdf, Size:1020Kb

CS229T/STAT231: Statistical Learning Theory (Winter 2016) Contents CS229T/STAT231: Statistical Learning Theory (Winter 2016) Percy Liang Last updated Wed Apr 20 2016 01:36 These lecture notes will be updated periodically as the course goes on. The Appendix describes the basic notation, definitions, and theorems. Contents 1 Overview4 1.1 What is this course about? (Lecture 1).....................4 1.2 Asymptotics (Lecture 1).............................5 1.3 Uniform convergence (Lecture 1)........................6 1.4 Kernel methods (Lecture 1)...........................8 1.5 Online learning (Lecture 1)............................9 2 Asymptotics 10 2.1 Overview (Lecture 1)............................... 10 2.2 Gaussian mean estimation (Lecture 1)..................... 11 2.3 Multinomial estimation (Lecture 1)....................... 13 2.4 Exponential families (Lecture 2)......................... 16 2.5 Maximum entropy principle (Lecture 2)..................... 19 2.6 Method of moments for latent-variable models (Lecture 3).......... 23 2.7 Fixed design linear regression (Lecture 3).................... 30 2.8 General loss functions and random design (Lecture 4)............. 33 2.9 Regularized fixed design linear regression (Lecture 4)............. 40 2.10 Summary (Lecture 4)............................... 44 2.11 References..................................... 45 3 Uniform convergence 46 3.1 Overview (Lecture 5)............................... 47 3.2 Formal setup (Lecture 5)............................. 47 3.3 Realizable finite hypothesis classes (Lecture 5)................. 50 3.4 Generalization bounds via uniform convergence (Lecture 5).......... 53 3.5 Concentration inequalities (Lecture 5)...................... 56 3.6 Finite hypothesis classes (Lecture 6)...................... 62 3.7 Concentration inequalities (continued) (Lecture 6)............... 63 3.8 Rademacher complexity (Lecture 6)....................... 66 3.9 Finite hypothesis classes (Lecture 7)...................... 72 3.10 Shattering coefficient (Lecture 7)........................ 74 1 3.11 VC dimension (Lecture 7)............................ 76 3.12 Norm-constrained hypothesis classes (Lecture 7)................ 81 3.13 Covering numbers (metric entropy) (Lecture 8)................ 88 3.14 Algorithmic stability (Lecture 9)......................... 96 3.15 PAC-Bayesian bounds (Lecture 9)........................ 100 3.16 Interpretation of bounds (Lecture 9)...................... 104 3.17 Summary (Lecture 9)............................... 105 3.18 References..................................... 107 4 Kernel methods 108 4.1 Motivation (Lecture 10)............................. 109 4.2 Kernels: definition and examples (Lecture 10)................. 111 4.3 Three views of kernel methods (Lecture 10).................. 114 4.4 Reproducing kernel Hilbert spaces (RKHS) (Lecture 10)........... 116 4.5 Learning using kernels (Lecture 11)....................... 121 4.6 Fourier properties of shift-invariant kernels (Lecture 11)............ 125 4.7 Efficient computation (Lecture 12)....................... 131 4.8 Universality (skipped in class).......................... 138 4.9 RKHS embedding of probability distributions (skipped in class)....... 139 4.10 Summary (Lecture 12).............................. 142 4.11 References..................................... 142 5 Online learning 143 5.1 Introduction (Lecture 13)............................ 143 5.2 Warm-up (Lecture 13).............................. 146 5.3 Online convex optimization (Lecture 13).................... 147 5.4 Follow the leader (FTL) (Lecture 13)...................... 151 5.5 Follow the regularized leader (FTRL) (Lecture 14)............... 155 5.6 Online subgradient descent (OGD) (Lecture 14)................ 158 5.7 Online mirror descent (OMD) (Lecture 14)................... 161 5.8 Regret bounds with Bregman divergences (Lecture 15)............ 165 5.9 Strong convexity and smoothness (Lecture 15)................. 167 5.10 Local norms (Lecture 15)............................. 171 5.11 Adaptive optimistic mirror descent (Lecture 16)................ 174 5.12 Online-to-batch conversion (Lecture 16)..................... 180 5.13 Adversarial bandits: expert advice (Lecture 16)................ 183 5.14 Adversarial bandits: online gradient descent (Lecture 16)........... 185 5.15 Stochastic bandits: upper confidence bound (UCB) (Lecture 16)....... 188 5.16 Stochastic bandits: Thompson sampling (Lecture 16)............. 191 5.17 Summary (Lecture 16).............................. 192 5.18 References..................................... 193 2 6 Neural networks (skipped in class) 194 6.1 Motivation (Lecture 16)............................. 194 6.2 Setup (Lecture 16)................................ 195 6.3 Approximation error (universality) (Lecture 16)................ 196 6.4 Generalization bounds (Lecture 16)....................... 198 6.5 Approximation error for polynomials (Lecture 16)............... 200 6.6 References..................................... 203 7 Conclusions and outlook 204 7.1 Review (Lecture 18)............................... 204 7.2 Changes at test time (Lecture 18)........................ 206 7.3 Alternative forms of supervision (Lecture 18).................. 208 7.4 Interaction between computation and statistics (Lecture 18)......... 209 A Appendix 211 A.1 Notation..................................... 211 A.2 Linear algebra................................... 212 A.3 Probability.................................... 213 A.4 Functional analysis................................ 215 3 [begin lecture 1] (1) 1 Overview 1.1 What is this course about? (Lecture 1) • Machine learning has become an indispensible part of many application areas, in both science (biology, neuroscience, psychology, astronomy, etc.) and engineering (natural language processing, computer vision, robotics, etc.). But machine learning is not a single approach; rather, it consists of a dazzling array of seemingly disparate frame- works and paradigms spanning classification, regression, clustering, matrix factoriza- tion, Bayesian networks, Markov random fields, etc. This course aims to uncover the common statistical principles underlying this diverse array of techniques. • This class is about the theoretical analysis of learning algorithms. Many of the analysis techniques introduced in this class|which involve a beautiful blend of probability, linear algebra, and optimization|are worth studying in their own right and are useful outside machine learning. For example, we will provide generic tools to bound the supremum of stochastic processes. We will show how to optimize an arbitrary sequence of convex functions and do as well on average compared to an expert that sees all the functions in advance. • Meanwhile, the practitioner of machine learning is hunkered down trying to get things to work. Suppose we want to build a classifier to predict the topic of a document (e.g., sports, politics, technology, etc.). We train a logistic regression with bag-of-words features and obtain 8% training error on 1000 training documents, test error is 13% on 1000 documents. There are many questions we could ask that could help us move forward. { How reliable are these numbers? If we reshuffled the data, would we get the same answer? { How much should we expect the test error to change if we double the number of examples? { What if we double the number of features? What if our features or parameters are sparse? { What if we double the regularization? Maybe use L1 regularization? { Should we change the model and use an SVM with a polynomial kernel or a neural network? In this class, we develop tools to tackle some of these questions. Our goal isn't to give precise quantitative answers (just like analyses of algorithms doesn't tell you how 4 exactly many hours a particular algorithm will run). Rather, the analyses will reveal the relevant quantities (e.g., dimension, regularization strength, number of training examples), and reveal how they influence the final test error. • While a deeper theoretical understanding can offer a new perspective and can aid in troubleshooting existing algorithms, it can also suggest new algorithms which might have been non-obvious without the conceptual scaffolding that theory provides. { A famous example is boosting. The following question was posed by Kearns and Valiant in the late 1980s: is it possible to combine weak classifiers (that get 51% accuracy) into a strong classifier (that get 99% accuracy)? This theoretical challenge eventually led to the development of AdaBoost in the mid 1990s, a simple and practical algorithm with strong theoretical guarantees. { In a more recent example, Google's latest 22-layer convolutional neural network that won the 2014 ImageNet Visual Recognition Challenge was initially inspired by a theoretically-motivated algorithm for learning deep neural networks with sparsity structure. There is obviously a large gap between theory and practice; theory relies on assump- tions can be simultaneously too strong (e.g., data are i.i.d.) and too weak (e.g., any distribution). The philosophy of this class is that the the purpose of theory here not to churn out formulas that you simply plug numbers into. Rather, theory should change the way you think. • This class is structured into four sections: asymptotics, uniform convergence, kernel methods, and online learning. We will move from very strong assumptions (assuming the data are Gaussian, in asymptotics) to very weak assumptions (assuming the data can be generated by an adversary, in online learning). Kernel methods is a bit of an outlier
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective
    1 A Survey on Data Collection for Machine Learning A Big Data - AI Integration Perspective Yuji Roh, Geon Heo, Steven Euijong Whang, Senior Member, IEEE Abstract—Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research. Index Terms—data collection, data acquisition, data labeling, machine learning F 1 INTRODUCTION E are living in exciting times where machine learning expertise. This problem applies to any novel application that W is having a profound influence on a wide range of benefits from machine learning.
    [Show full text]
  • Logistic Regression Trained with Different Loss Functions Discussion
    Logistic Regression Trained with Different Loss Functions Discussion CS6140 1 Notations We restrict our discussions to the binary case. 1 g(z) = 1 + e−z @g(z) g0(z) = = g(z)(1 − g(z)) @z 1 1 h (x) = g(wx) = = w −wx − P wdxd 1 + e 1 + e d P (y = 1jx; w) = hw(x) P (y = 0jx; w) = 1 − hw(x) 2 Maximum Likelihood Estimation 2.1 Goal Maximize likelihood: L(w) = p(yjX; w) m Y = p(yijxi; w) i=1 m Y yi 1−yi = (hw(xi)) (1 − hw(xi)) i=1 1 Or equivalently, maximize the log likelihood: l(w) = log L(w) m X = yi log h(xi) + (1 − yi) log(1 − h(xi)) i=1 2.2 Stochastic Gradient Descent Update Rule @ 1 1 @ j l(w) = (y − (1 − y) ) j g(wxi) @w g(wxi) 1 − g(wxi) @w 1 1 @ = (y − (1 − y) )g(wxi)(1 − g(wxi)) j wxi g(wxi) 1 − g(wxi) @w j = (y(1 − g(wxi)) − (1 − y)g(wxi))xi j = (y − hw(xi))xi j j j w := w + λ(yi − hw(xi)))xi 3 Least Squared Error Estimation 3.1 Goal Minimize sum of squared error: m 1 X L(w) = (y − h (x ))2 2 i w i i=1 3.2 Stochastic Gradient Descent Update Rule @ @h (x ) L(w) = −(y − h (x )) w i @wj i w i @wj j = −(yi − hw(xi))hw(xi)(1 − hw(xi))xi j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi 4 Comparison 4.1 Update Rule For maximum likelihood logistic regression: j j j w := w + λ(yi − hw(xi)))xi 2 For least squared error logistic regression: j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi Let f1(h) = (y − h); y 2 f0; 1g; h 2 (0; 1) f2(h) = (y − h)h(1 − h); y 2 f0; 1g; h 2 (0; 1) When y = 1, the plots of f1(h) and f2(h) are shown in figure 1.
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss
    Regularized Regression under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss Here we considerthe problem of learning binary classiers. We assume a set X of possible inputs and we are interested in classifying inputs into one of two classes. For example we might be interesting in predicting whether a given persion is going to vote democratic or republican. We assume a function Φ which assigns a feature vector to each element of x — we assume that for x ∈ X we have d Φ(x) ∈ R . For 1 ≤ i ≤ d we let Φi(x) be the ith coordinate value of Φ(x). For example, for a person x we might have that Φ(x) is a vector specifying income, age, gender, years of education, and other properties. Discrete properties can be represented by binary valued fetures (indicator functions). For example, for each state of the United states we can have a component Φi(x) which is 1 if x lives in that state and 0 otherwise. We assume that we have training data consisting of labeled inputs where, for convenience, we assume that the labels are all either −1 or 1. S = hx1, yyi,..., hxT , yT i xt ∈ X yt ∈ {−1, 1} Our objective is to use the training data to construct a predictor f(x) which predicts y from x. Here we will be interested in predictors of the following form where β ∈ Rd is a parameter vector to be learned from the training data. fβ(x) = sign(β · Φ(x)) (1) We are then interested in learning a parameter vector β from the training data.
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • The Central Limit Theorem in Differential Privacy
    Privacy Loss Classes: The Central Limit Theorem in Differential Privacy David M. Sommer Sebastian Meiser Esfandiar Mohammadi ETH Zurich UCL ETH Zurich [email protected] [email protected] [email protected] August 12, 2020 Abstract Quantifying the privacy loss of a privacy-preserving mechanism on potentially sensitive data is a complex and well-researched topic; the de-facto standard for privacy measures are "-differential privacy (DP) and its versatile relaxation (, δ)-approximate differential privacy (ADP). Recently, novel variants of (A)DP focused on giving tighter privacy bounds under continual observation. In this paper we unify many previous works via the privacy loss distribution (PLD) of a mechanism. We show that for non-adaptive mechanisms, the privacy loss under sequential composition undergoes a convolution and will converge to a Gauss distribution (the central limit theorem for DP). We derive several relevant insights: we can now characterize mechanisms by their privacy loss class, i.e., by the Gauss distribution to which their PLD converges, which allows us to give novel ADP bounds for mechanisms based on their privacy loss class; we derive exact analytical guarantees for the approximate randomized response mechanism and an exact analytical and closed formula for the Gauss mechanism, that, given ", calculates δ, s.t., the mechanism is ("; δ)-ADP (not an over- approximating bound). 1 Contents 1 Introduction 4 1.1 Contribution . .4 2 Overview 6 2.1 Worst-case distributions . .6 2.2 The privacy loss distribution . .6 3 Related Work 7 4 Privacy Loss Space 7 4.1 Privacy Loss Variables / Distributions .
    [Show full text]
  • 1 Convolution
    CS1114 Section 6: Convolution February 27th, 2013 1 Convolution Convolution is an important operation in signal and image processing. Convolution op- erates on two signals (in 1D) or two images (in 2D): you can think of one as the \input" signal (or image), and the other (called the kernel) as a “filter” on the input image, pro- ducing an output image (so convolution takes two images as input and produces a third as output). Convolution is an incredibly important concept in many areas of math and engineering (including computer vision, as we'll see later). Definition. Let's start with 1D convolution (a 1D \image," is also known as a signal, and can be represented by a regular 1D vector in Matlab). Let's call our input vector f and our kernel g, and say that f has length n, and g has length m. The convolution f ∗ g of f and g is defined as: m X (f ∗ g)(i) = g(j) · f(i − j + m=2) j=1 One way to think of this operation is that we're sliding the kernel over the input image. For each position of the kernel, we multiply the overlapping values of the kernel and image together, and add up the results. This sum of products will be the value of the output image at the point in the input image where the kernel is centered. Let's look at a simple example. Suppose our input 1D image is: f = 10 50 60 10 20 40 30 and our kernel is: g = 1=3 1=3 1=3 Let's call the output image h.
    [Show full text]
  • Deep Clustering with Convolutional Autoencoders
    Deep Clustering with Convolutional Autoencoders Xifeng Guo1, Xinwang Liu1, En Zhu1, and Jianping Yin2 1 College of Computer, National University of Defense Technology, Changsha, 410073, China [email protected] 2 State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China Abstract. Deep clustering utilizes deep neural networks to learn fea- ture representation that is suitable for clustering tasks. Though demon- strating promising performance in various applications, we observe that existing deep clustering algorithms either do not well take advantage of convolutional neural networks or do not considerably preserve the local structure of data generating distribution in the learned feature space. To address this issue, we propose a deep convolutional embedded clus- tering algorithm in this paper. Specifically, we develop a convolutional autoencoders structure to learn embedded features in an end-to-end way. Then, a clustering oriented loss is directly built on embedded features to jointly perform feature refinement and cluster assignment. To avoid feature space being distorted by the clustering loss, we keep the decoder remained which can preserve local structure of data in feature space. In sum, we simultaneously minimize the reconstruction loss of convolutional autoencoders and the clustering loss. The resultant optimization prob- lem can be effectively solved by mini-batch stochastic gradient descent and back-propagation. Experiments on benchmark datasets empirically validate
    [Show full text]
  • Tensorizing Neural Networks
    Tensorizing Neural Networks Alexander Novikov1;4 Dmitry Podoprikhin1 Anton Osokin2 Dmitry Vetrov1;3 1Skolkovo Institute of Science and Technology, Moscow, Russia 2INRIA, SIERRA project-team, Paris, France 3National Research University Higher School of Economics, Moscow, Russia 4Institute of Numerical Mathematics of the Russian Academy of Sciences, Moscow, Russia [email protected] [email protected] [email protected] [email protected] Abstract Deep neural networks currently demonstrate state-of-the-art performance in sev- eral domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train [17] format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved. In particular, for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times. 1 Introduction Deep neural networks currently demonstrate state-of-the-art performance in many domains of large- scale machine learning, such as computer vision, speech recognition, text processing, etc. These advances have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
    [Show full text]
  • Fully Convolutional Mesh Autoencoder Using Efficient Spatially Varying Kernels
    Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels Yi Zhou∗ Chenglei Wu Adobe Research Facebook Reality Labs Zimo Li Chen Cao Yuting Ye University of Southern California Facebook Reality Labs Facebook Reality Labs Jason Saragih Hao Li Yaser Sheikh Facebook Reality Labs Pinscreen Facebook Reality Labs Abstract Learning latent representations of registered meshes is useful for many 3D tasks. Techniques have recently shifted to neural mesh autoencoders. Although they demonstrate higher precision than traditional methods, they remain unable to capture fine-grained deformations. Furthermore, these methods can only be applied to a template-specific surface mesh, and is not applicable to more general meshes, like tetrahedrons and non-manifold meshes. While more general graph convolution methods can be employed, they lack performance in reconstruction precision and require higher memory usage. In this paper, we propose a non-template-specific fully convolutional mesh autoencoder for arbitrary registered mesh data. It is enabled by our novel convolution and (un)pooling operators learned with globally shared weights and locally varying coefficients which can efficiently capture the spatially varying contents presented by irregular mesh connections. Our model outperforms state-of-the-art methods on reconstruction accuracy. In addition, the latent codes of our network are fully localized thanks to the fully convolutional structure, and thus have much higher interpolation capability than many traditional 3D mesh generation models. 1 Introduction arXiv:2006.04325v2 [cs.CV] 21 Oct 2020 Learning latent representations for registered meshes 2, either from performance capture or physical simulation, is a core component for many 3D tasks, ranging from compressing and reconstruction to animation and simulation.
    [Show full text]