Deep Learning 13.3. Transformer Networks

Total Page:16

File Type:pdf, Size:1020Kb

Deep Learning 13.3. Transformer Networks Deep learning 13.3. Transformer Networks Fran¸coisFleuret https://fleuret.org/dlc/ Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and recurrent operations, they designed a models combining only attention layers. They designed this \transformer" for a sequence-to-sequence translation task, but it is currently key to state-of-the-art approaches across NLP tasks. Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 1 / 30 Notes The standard practice is to train a transformer in a non-supervised manner on large unlabeled datasets such as Wikipedia{or re-use a pre-trained transformer{and then fine tune it in a supervised manner for tasks which require a ground truth such as sentiment analysis. They first introduce a multi-head attention module. Scaled Dot-Product Attention Multi-Head Attention (Vaswani et al., 2017) QK > FigureAttention( 2: (left) ScaledQ; K Dot-Product; V ) = softmax Attention. (right)p Multi-HeadV Attention consists of several attention layers running in parallel. dk O MultiHead(Q; K; V ) = Concat (H1;:::; Hh) W 3.2.1 Scaled Dot-Product Attention Q K V Hi = Attention QWi ; KWi ; VWi ; i = 1;:::; h with We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the Q dmodel ×dk K dmodel ×dk V dmodel ×dv O hdv ×dmodel Wvalues.i 2 R ; Wi 2 R ; Wi 2 R ; W 2 R In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute Fran¸coisFleuretthe matrix of outputs as: Deep learning / 13.3. Transformer Networks 2 / 30 QKT Attention(Q, K, V ) = softmax( )V (1) √ Notes dk The \scaledThe dot-product two most commonly attention" used (left) attention is very functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor close to theof attention1 . Additive module attention we saw computes in lecture the compatibility function using a feed-forward network with 13.2. \Attention√dk Mechanisms", with the addition a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is of an optionalmuch masking faster and (in more pink). space-efficient This may in bepractice, since it can be implemented using highly optimized useful whenmatrix such multiplication a module is code. used for a generative auto-regressive operation and the While for small values of d the two mechanisms perform similarly, additive attention outperforms attention should be causal, lookingk only to the dot product attention without scaling for larger values of d [3]. We suspect that for large values of past. k dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by 1 . √dk The attention is a function of the keys, queries, and values.3.2.2 The only Multi-Head difference Attention with what was seen in the previous course is that the attention matrix is rescaledInstead of with performing the dimension a single attention of the function with dmodel-dimensional keys, values and queries, h embedding,we which found matters it beneficial quite to linearly a lot. project the queries, keys and values times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional In the multi-headoutput values. attention, These each are concatenated head h has and its once again projected, resulting in the final values, as own processingdepicted of the in Figure input 2. keys, queries, and K Q V values through4 respectively Wi , Wi , and Wi . To illustrate why the dot productsO get large, assume that the components of q and k are independent random And there is one final processing W applied on dk variables with mean 0 and variance 1. Then their dot product, q k = i=1 qiki, has mean 0 and variance dk. the concatenated results of the multiple heads. · P 4 Their complete model is composed of: • An encoder that combines N = 6 modules each composed of a multi-head attention sub-module, and a [per-component] one hidden-layer MLP, with residual pass-through and layer normalization. • A decoder with a similar structure, but with causal attention layers to allow for regression training, and additional attention layers that attend to the layers of the encoder. Positional information is provided through an additive positional encoding of same dimension dmodel as the internal representation, and is of the form 0 1 t PE = sin t;2i @ 2i A 10;000 dmodel 0 1 t PE = cos : t;2i+1 @ 2i+1 A 10;000 dmodel Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 3 / 30 Notes Contrary to what we previously saw with the concatenated binary positional encoding, here the position is provided as additive encoding, where t is the position in the sequence, and2 i and2 i + 1 the dimension. Figure 1: The Transformer - model architecture. (Vaswani et al., 2017) Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 4 / 30 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two Notes sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is This is aLayerNorm( depictionx of+ the Sublayer( standardx)), transformer where Sublayer(xthat) is the applies function a implementedone hidden layerby the perceptron sub-layer at architectureitself. for To facilitatesequence-to-sequence these residual connections, all sub-layersevery position in the model, of the as wellsequence as the embedding separately. This translation.layers, It produce consists outputs of an of encoder dimension (leftdmodel part)= 512. can be implemented with1 1 convolutions. and a decoder (right part). Both are a stack of Both the self-attention and× the feed-forward are N = 6 modules. combined with residual pass-through. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head Each tokenattention (subword) over the outputof the of input the encoder sequence stack. is SimilarThe to the decoder encoder, is we an employ auto-regressive residual connections model, and encodedaround with a each look-up of the sub-layers,table to get followed its by layer normalization.each of its module We also modify has a multi-head the self-attention embeddingsub-layer of dimension in the decoderd, so stack that to the prevent input positions is self-attention from attending operation, to subsequent then positions. an attention This that a tensormasking, of size T combinedd. Then with fact the that positional the output embeddingsattends are offset to the by encoder, one position, and ensures a feed-forward that the i i encodingpredictions of same for size× position is addedcan to depend it. only on the knownoperation. outputs atThe positions self-attention less than is. masked to make it causal, i.e. it takes into account only the part Each of3.2 the N Attentionmodules of the encoder is of the sequence already generated. The composed of a multi-head self-attention attention to the encoder is not masked but its operationAn followed attention byfunction a \feed can forward"be described operation as mapping akeys query and and values a set of are key-value functions pairs toof an the output, outputs of where the query, keys, values, and output are all vectors.the corresponding The output is computed module as in a weighted the encoding sum stack. of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 The architecture is tested on English-to-German and English-to-French translation using the standard WMT2014 datasets. • English-to-German: 4.5M sentence pairs, 37k tokens vocabulary. • English-to-French: 36M sentence pairs, 32k tokens vocabulary. • 8 P100 GPUs (150 TFlops FP16), 0.5 day for the small model, 3.5 days for the large one. Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 5 / 30 Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. BLEU Training Cost (FLOPs) Model EN-DE EN-FR EN-DE EN-FR ByteNet [18] 23.75 Deep-Att + PosUnk [39] 39.2 1.0 1020 GNMT + RL [38] 24.6 39.92 2.3 1019 1.4 · 1020 ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020 MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020 · · Deep-Att + PosUnk Ensemble [39] 40.4 8.0 1020 GNMT + RL Ensemble [38] 26.30 41.16 1.8 1020 1.1 · 1021 · 19 · 21 ConvS2S Ensemble [9] 26.36 41.29 7.7 10 1.2 10 · · Transformer (base model) 27.3 38.1 3.3 1018 · Transformer (big) 28.4 41.8 2.3 1019 · Label Smoothing During training, we employed label smoothing of value ls = 0(Vaswani.1 [36].
Recommended publications
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Face Recognition: a Convolutional Neural-Network Approach
    98 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997 Face Recognition: A Convolutional Neural-Network Approach Steve Lawrence, Member, IEEE, C. Lee Giles, Senior Member, IEEE, Ah Chung Tsoi, Senior Member, IEEE, and Andrew D. Back, Member, IEEE Abstract— Faces represent complex multidimensional mean- include fingerprints [4], speech [7], signature dynamics [36], ingful visual stimuli and developing a computational model for and face recognition [8]. Sales of identity verification products face recognition is difficult. We present a hybrid neural-network exceed $100 million [29]. Face recognition has the benefit of solution which compares favorably with other methods. The system combines local image sampling, a self-organizing map being a passive, nonintrusive system for verifying personal (SOM) neural network, and a convolutional neural network. identity. The techniques used in the best face recognition The SOM provides a quantization of the image samples into a systems may depend on the application of the system. We topological space where inputs that are nearby in the original can identify at least two broad categories of face recognition space are also nearby in the output space, thereby providing systems. dimensionality reduction and invariance to minor changes in the image sample, and the convolutional neural network provides for 1) We want to find a person within a large database of partial invariance to translation, rotation, scale, and deformation. faces (e.g., in a police database). These systems typically The convolutional network extracts successively larger features return a list of the most likely people in the database in a hierarchical set of layers. We present results using the [34].
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Automated Elastic Pipelining for Distributed Training of Transformers
    PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers Chaoyang He 1 Shen Li 2 Mahdi Soltanolkotabi 1 Salman Avestimehr 1 Abstract the-art convolutional networks ResNet-152 (He et al., 2016) and EfficientNet (Tan & Le, 2019). To tackle the growth in The size of Transformer models is growing at an model sizes, researchers have proposed various distributed unprecedented rate. It has taken less than one training techniques, including parameter servers (Li et al., year to reach trillion-level parameters since the 2014; Jiang et al., 2020; Kim et al., 2019), pipeline paral- release of GPT-3 (175B). Training such models lel (Huang et al., 2019; Park et al., 2020; Narayanan et al., requires both substantial engineering efforts and 2019), intra-layer parallel (Lepikhin et al., 2020; Shazeer enormous computing resources, which are luxu- et al., 2018; Shoeybi et al., 2019), and zero redundancy data ries most research teams cannot afford. In this parallel (Rajbhandari et al., 2019). paper, we propose PipeTransformer, which leverages automated elastic pipelining for effi- T0 (0% trained) T1 (35% trained) T2 (75% trained) T3 (100% trained) cient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically Layer (end of training) Layer (end of training) Layer (end of training) Layer (end of training) Similarity score allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the Figure 1. Interpretable Freeze Training: DNNs converge bottom pipeline, packs active layers into fewer GPUs, up (Results on CIFAR10 using ResNet).
    [Show full text]
  • CNN Architectures
    Lecture 9: CNN Architectures Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 1 May 2, 2017 Administrative A2 due Thu May 4 Midterm: In-class Tue May 9. Covers material through Thu May 4 lecture. Poster session: Tue June 6, 12-3pm Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 2 May 2, 2017 Last time: Deep learning frameworks Paddle (Baidu) Caffe Caffe2 (UC Berkeley) (Facebook) CNTK (Microsoft) Torch PyTorch (NYU / Facebook) (Facebook) MXNet (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of Theano TensorFlow choice at AWS (U Montreal) (Google) And others... Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 3 May 2, 2017 Last time: Deep learning frameworks (1) Easily build big computational graphs (2) Easily compute gradients in computational graphs (3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 4 May 2, 2017 Last time: Deep learning frameworks Modularized layers that define forward and backward pass Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 5 May 2, 2017 Last time: Deep learning frameworks Define model architecture as a sequence of layers Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 6 May 2, 2017 Today: CNN Architectures Case Studies - AlexNet - VGG - GoogLeNet - ResNet Also.... - NiN (Network in Network) - DenseNet - Wide ResNet - FractalNet - ResNeXT - SqueezeNet - Stochastic Depth Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 2, 2017 Review: LeNet-5 [LeCun et al., 1998] Conv filters were 5x5, applied at stride 1 Subsampling (Pooling) layers were 2x2 applied at stride 2 i.e.
    [Show full text]
  • 1 Convolution
    CS1114 Section 6: Convolution February 27th, 2013 1 Convolution Convolution is an important operation in signal and image processing. Convolution op- erates on two signals (in 1D) or two images (in 2D): you can think of one as the \input" signal (or image), and the other (called the kernel) as a “filter” on the input image, pro- ducing an output image (so convolution takes two images as input and produces a third as output). Convolution is an incredibly important concept in many areas of math and engineering (including computer vision, as we'll see later). Definition. Let's start with 1D convolution (a 1D \image," is also known as a signal, and can be represented by a regular 1D vector in Matlab). Let's call our input vector f and our kernel g, and say that f has length n, and g has length m. The convolution f ∗ g of f and g is defined as: m X (f ∗ g)(i) = g(j) · f(i − j + m=2) j=1 One way to think of this operation is that we're sliding the kernel over the input image. For each position of the kernel, we multiply the overlapping values of the kernel and image together, and add up the results. This sum of products will be the value of the output image at the point in the input image where the kernel is centered. Let's look at a simple example. Suppose our input 1D image is: f = 10 50 60 10 20 40 30 and our kernel is: g = 1=3 1=3 1=3 Let's call the output image h.
    [Show full text]
  • Deep Clustering with Convolutional Autoencoders
    Deep Clustering with Convolutional Autoencoders Xifeng Guo1, Xinwang Liu1, En Zhu1, and Jianping Yin2 1 College of Computer, National University of Defense Technology, Changsha, 410073, China [email protected] 2 State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China Abstract. Deep clustering utilizes deep neural networks to learn fea- ture representation that is suitable for clustering tasks. Though demon- strating promising performance in various applications, we observe that existing deep clustering algorithms either do not well take advantage of convolutional neural networks or do not considerably preserve the local structure of data generating distribution in the learned feature space. To address this issue, we propose a deep convolutional embedded clus- tering algorithm in this paper. Specifically, we develop a convolutional autoencoders structure to learn embedded features in an end-to-end way. Then, a clustering oriented loss is directly built on embedded features to jointly perform feature refinement and cluster assignment. To avoid feature space being distorted by the clustering loss, we keep the decoder remained which can preserve local structure of data in feature space. In sum, we simultaneously minimize the reconstruction loss of convolutional autoencoders and the clustering loss. The resultant optimization prob- lem can be effectively solved by mini-batch stochastic gradient descent and back-propagation. Experiments on benchmark datasets empirically validate
    [Show full text]
  • Tensorizing Neural Networks
    Tensorizing Neural Networks Alexander Novikov1;4 Dmitry Podoprikhin1 Anton Osokin2 Dmitry Vetrov1;3 1Skolkovo Institute of Science and Technology, Moscow, Russia 2INRIA, SIERRA project-team, Paris, France 3National Research University Higher School of Economics, Moscow, Russia 4Institute of Numerical Mathematics of the Russian Academy of Sciences, Moscow, Russia [email protected] [email protected] [email protected] [email protected] Abstract Deep neural networks currently demonstrate state-of-the-art performance in sev- eral domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train [17] format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved. In particular, for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times. 1 Introduction Deep neural networks currently demonstrate state-of-the-art performance in many domains of large- scale machine learning, such as computer vision, speech recognition, text processing, etc. These advances have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
    [Show full text]
  • Fully Convolutional Mesh Autoencoder Using Efficient Spatially Varying Kernels
    Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels Yi Zhou∗ Chenglei Wu Adobe Research Facebook Reality Labs Zimo Li Chen Cao Yuting Ye University of Southern California Facebook Reality Labs Facebook Reality Labs Jason Saragih Hao Li Yaser Sheikh Facebook Reality Labs Pinscreen Facebook Reality Labs Abstract Learning latent representations of registered meshes is useful for many 3D tasks. Techniques have recently shifted to neural mesh autoencoders. Although they demonstrate higher precision than traditional methods, they remain unable to capture fine-grained deformations. Furthermore, these methods can only be applied to a template-specific surface mesh, and is not applicable to more general meshes, like tetrahedrons and non-manifold meshes. While more general graph convolution methods can be employed, they lack performance in reconstruction precision and require higher memory usage. In this paper, we propose a non-template-specific fully convolutional mesh autoencoder for arbitrary registered mesh data. It is enabled by our novel convolution and (un)pooling operators learned with globally shared weights and locally varying coefficients which can efficiently capture the spatially varying contents presented by irregular mesh connections. Our model outperforms state-of-the-art methods on reconstruction accuracy. In addition, the latent codes of our network are fully localized thanks to the fully convolutional structure, and thus have much higher interpolation capability than many traditional 3D mesh generation models. 1 Introduction arXiv:2006.04325v2 [cs.CV] 21 Oct 2020 Learning latent representations for registered meshes 2, either from performance capture or physical simulation, is a core component for many 3D tasks, ranging from compressing and reconstruction to animation and simulation.
    [Show full text]
  • Deep Learning Architectures for Sequence Processing
    Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. CHAPTER Deep Learning Architectures 9 for Sequence Processing Time will explain. Jane Austen, Persuasion Language is an inherently temporal phenomenon. Spoken language is a sequence of acoustic events over time, and we comprehend and produce both spoken and written language as a continuous input stream. The temporal nature of language is reflected in the metaphors we use; we talk of the flow of conversations, news feeds, and twitter streams, all of which emphasize that language is a sequence that unfolds in time. This temporal nature is reflected in some of the algorithms we use to process lan- guage. For example, the Viterbi algorithm applied to HMM part-of-speech tagging, proceeds through the input a word at a time, carrying forward information gleaned along the way. Yet other machine learning approaches, like those we’ve studied for sentiment analysis or other text classification tasks don’t have this temporal nature – they assume simultaneous access to all aspects of their input. The feedforward networks of Chapter 7 also assumed simultaneous access, al- though they also had a simple model for time. Recall that we applied feedforward networks to language modeling by having them look only at a fixed-size window of words, and then sliding this window over the input, making independent predictions along the way. Fig. 9.1, reproduced from Chapter 7, shows a neural language model with window size 3 predicting what word follows the input for all the. Subsequent words are predicted by sliding the window forward a word at a time.
    [Show full text]
  • Convolutional Neural Network Transfer Learning for Robust Face
    Convolutional Neural Network Transfer Learning for Robust Face Recognition in NAO Humanoid Robot Daniel Bussey1, Alex Glandon2, Lasitha Vidyaratne2, Mahbubul Alam2, Khan Iftekharuddin2 1Embry-Riddle Aeronautical University, 2Old Dominion University Background Research Approach Results • Artificial Neural Networks • Retrain AlexNet on the CASIA-WebFace dataset to configure the neural • VGG-Face Shows better results in every performance benchmark measured • Computing systems whose model architecture is inspired by biological networf for face recognition tasks. compared to AlexNet, although AlexNet is able to extract features from an neural networks [1]. image 800% faster than VGG-Face • Capable of improving task performance without task-specific • Acquire an input image using NAO’s camera or a high resolution camera to programming. run through the convolutional neural network. • Resolution of the input image does not have a statistically significant impact • Show excellent performance at classification based tasks such as face on the performance of VGG-Face and AlexNet. recognition [2], text recognition [3], and natural language processing • Extract the features of the input image using the neural networks AlexNet [4]. and VGG-Face • AlexNet’s performance decreases with respect to distance from the camera where VGG-Face shows no performance loss. • Deep Learning • Compare the features of the input image to the features of each image in the • A subfield of machine learning that focuses on algorithms inspired by people database. • Both frameworks show excellent performance when eliminating false the function and structure of the brain called artificial neural networks. positives. The method used to allow computers to learn through a process called • Determine whether a match is detected or if the input image is a photo of a training.
    [Show full text]
  • Geometrical Aspects of Statistical Learning Theory
    Geometrical Aspects of Statistical Learning Theory Vom Fachbereich Informatik der Technischen Universit¨at Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) vorgelegt von Dipl.-Phys. Matthias Hein aus Esslingen am Neckar Prufungskommission:¨ Vorsitzender: Prof. Dr. B. Schiele Erstreferent: Prof. Dr. T. Hofmann Korreferent : Prof. Dr. B. Sch¨olkopf Tag der Einreichung: 30.9.2005 Tag der Disputation: 9.11.2005 Darmstadt, 2005 Hochschulkennziffer: D17 Abstract Geometry plays an important role in modern statistical learning theory, and many different aspects of geometry can be found in this fast developing field. This thesis addresses some of these aspects. A large part of this work will be concerned with so called manifold methods, which have recently attracted a lot of interest. The key point is that for a lot of real-world data sets it is natural to assume that the data lies on a low-dimensional submanifold of a potentially high-dimensional Euclidean space. We develop a rigorous and quite general framework for the estimation and ap- proximation of some geometric structures and other quantities of this submanifold, using certain corresponding structures on neighborhood graphs built from random samples of that submanifold. Another part of this thesis deals with the generalizati- on of the maximal margin principle to arbitrary metric spaces. This generalization follows quite naturally by changing the viewpoint on the well-known support vector machines (SVM). It can be shown that the SVM can be seen as an algorithm which applies the maximum margin principle to a subclass of metric spaces. The motivati- on to consider the generalization to arbitrary metric spaces arose by the observation that in practice the condition for the applicability of the SVM is rather difficult to check for a given metric.
    [Show full text]