A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound

Total Page:16

File Type:pdf, Size:1020Kb

A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound Ming Ji∗‡ [email protected] Tianbao Yang∗† [email protected] Binbin Lin\ [email protected] Rong Jiny [email protected] Jiawei Hanz [email protected] zDepartment of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA yDepartment of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA \State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, 310058, China ∗Equal contribution Abstract which states that the prediction function lives in a low In this work, we develop a simple algorithm dimensional manifold of the marginal distribution PX . for semi-supervised regression. The key idea It has been pointed out by several stud- is to use the top eigenfunctions of integral ies (Lafferty & Wasserman, 2007; Nadler et al., operator derived from both labeled and un- 2009) that the manifold assumption by itself is labeled examples as the basis functions and insufficient to reduce the generalization error bound learn the prediction function by a simple lin- of supervised learning. However, on the other hand, it ear regression. We show that under appropri- was found in (Niyogi, 2008) that for certain learning ate assumptions about the integral operator, problems, no supervised learner can learn effectively, this approach is able to achieve an improved while a manifold based learner (that knows the man- regression error bound better than existing ifold or learns it from unlabeled examples) can learn bounds of supervised learning. We also veri- well with relatively few labeled examples. Compared fy the effectiveness of the proposed algorithm to the manifold assumption, theoretical results based by an empirical study. on cluster assumption appear to be more encouraging. In the early studies (Castelli & Cover, 1995; 1996), the authors show that under the assumption that 1. Introduction the marginal distribution PX is a mixture of class conditional distributions, the generalization error will Although numerous algorithms have been develope- be reduced exponentially in the number of labeled d for semi-supervised learning (Zhu (2008) and ref- examples if the mixture is identifiable. Rigollet erences therein), most of them do not have theoreti- (2007) defines the cluster assumption in terms of cal guarantee on improving the generalization perfor- density level sets, and shows a similar exponential mance of supervised learning. A number of theories convergence rate given a sufficiently large number have been proposed for semi-supervised learning, and of unlabeled examples. Furthermore, Singh et al. most of them are based on one of the two assumption- (2008) show that the mixture components can be s: (1) the cluster assumption (Seeger, 2001; Rigollet, identified if PX is a mixture of a finite number of 2007; Lafferty & Wasserman, 2007; Singh et al., 2008; smooth density functions and the separation/overlap Sinha & Belkin, 2009) which assumes that two da- between different mixture components is significantly ta points should have the same class label or sim- large. Despite the encouraging results, one major ilar values if they are connected by a path passing problem of the cluster assumption is that it is difficult through a high density region; (2) the manifold as- to be verified given a limited number of labeled exam- sumption (Lafferty & Wasserman, 2007; Niyogi, 2008) ples. In addition, the learning algorithms suggested Appearing in Proceedings of the 29 th International Confer- in (Rigollet, 2007; Singh et al., 2008; Zhang & Ando, ence on Machine Learning, Edinburgh, Scotland, UK, 2012. 2005) are difficult to implement efficiently even if the Copyright 2012 by the author(s)/owner(s). cluster assumption holds, making them unpractical A Simple Algorithm for Semi-supervised Learning for real-world problems. Algorithm 1 A Simple Algorithm for Semi- supervised Learning In this work, we aim to develop a simple algorithm 1: Input for semi-supervised learning that on one hand is easy • D = fx ;:::; x g: labeled and unlabeled ex- to implement, and on the other hand is guaranteed 1 N amples to improve the generalization performance of super- • y = (y ; : : : ; y )>: labels for the first n ex- vised learning under appropriate assumptions. The l 1 n amples in D main idea of the proposed algorithm is to estimate the • s: the number of eigenfunctions to be used top eigenfunctions of the integral operator from the 2: Compute (ϕb ; λb ); i = 1; : : : ; s, the first s eigen- both labeled and unlabeled examples, and learn from i i functions and eigenvalues for the integral operator the labeled examples the best prediction function in Lb defined in (4). the subspace spanned by the estimated eigenfunction- N 3: Compute the prediction gb(x) in (5), where γ∗ = s. Unlike the previous studies of exploring eigenfunc- (γ∗; : : : ; γ∗)> is given by solving the following re- tions for semi-supervised learning (Fergus et al., 2009; 1 s gression problem Sinha & Belkin, 2009), we show that under appro- 0 1 priate assumptions, the proposed algorithm achieves 2 Xn Xs a better generalization error bound than supervised ∗ @ b A γ = arg min γjϕj(xi) − yi (1) learning algorithms. 2Rs γ i=1 j=1 To derive the generalization error bound, we make a different set of assumptions from previous stud- 4: Output prediction function gb(·) ies. First, we assume a skewed eigenvalue distribu- tion and bounded eigenfunctions of the integral oper- ator. The assumption of skewed eigenvalue distribu- exploiting both labeled and unlabeled examples. Be- tions has been verified and used in multiple studies low we first present our algorithm and then verify its of kernel learning (Koltchinskii, 2011; Steinwart et al., empirical performance by comparing to the state-of- 2006; Minh, 2010; Zhang & Ando, 2005), while the as- the-art algorithms for supervised and semi-supervised sumption of bounded eigenvectors was mostly found learning. in the study of compressive sensing (Cand`es& Tao, 2006). Second, we assume that a sufficient num- 2.1. A Simple algorithm for Semi-Supervised ber of labeled examples are available, which is also Learning used by the other analysis of semi-supervised learn- Let κ(·; ·): X × X ! R be a Mercer kernel, and let ing (Rigollet, 2007). It is the combination of these H be a Reproducing Kernel Hilbert space (RKHS) assumptions that allow us to derive better generaliza- κ of functions X! R endowed with kernel κ(·; ·). We tion error bound for semi-supervised learning. assume that κ is a bounded function, i.e., jκ(x; x)j ≤ The rest of the paper is arranged as follows. Section 2 1; 8x 2 X . Similar to most semi-supervised learning presents the proposed algorithm and verifies its effec- algorithms, in order to effectively exploit the unlabeled tiveness by an empirical study. Section 3 shows the data, we need to relate the prediction function f(x) to improved generalization error bound for the proposed the unlabeled examples (or the marginal distribution semi-supervised learning, and Section 4 outlines the PX ). To this end, we assume there exists an accurate 2 H k k ≤ proofs. Section 5 concludes with future work. prediction function g(x) κ with g Hκ R. More specifically, we define 2. Algorithm and Empirical Validation 2 2 " = min Ex[(f(x) − h(x)) ]; (2) h2H ;khkH ≤R X κ κ Let be a compact domain or a manifold in the Eu- 2 d g(x) = arg min Ex[(f(x) − h(x)) ]: (3) clidean space R . Let D = fxi; i = 1;:::;N jxi 2 X g h2H ;khkH ≤R be a collection of training examples. We randomly s- κ κ elect n examples from D for labeling. Without loss Our basic assumption (A0) is that the regression error of generality, we assume that the first n examples are "2 ≪ R2 is small, and the maximum regression error > 2 Rn labeled by yl = (y1; : : : ; yn) . We denote by of g(x) for any x 2 X is also small, i.e., > N y = (y1; : : : ; yN ) 2 R the true labels for all the D − 2 , 2 2 examples in . In this study, we assume y = f(x) is sup(f(x) g(x)) "max = O(n" = ln N): decided by an unknown deterministic function f(x). x2X Our goal is to learn an accurate prediction function by To present our algorithm, we define an integral oper- A Simple Algorithm for Semi-supervised Learning ator over the examples in D: Table 1. Statistics of datasets XN Name #Objects #Features b 1 L (f)(·) = κ(x ; ·)f(x ); (4) insurance 9; 822 85 N N i i i=1 wine 4; 898 11 b b temperature 9; 504 2 where f 2 Hκ. Let (ϕi(x); λi); i = 1; 2;:::;N be the b eigenfunctions and eigenvalues of LN ranked in the de- hb · b · i scending order of eigenvalues, where ϕi( ); ϕj( ) Hκ = δ(i; j) for any 1 ≤ i; j ≤ N. According Learning Repository (Frank & Asuncion, 2010), while to (Guo & Zhou, 2011), the prediction function g(x) the task of the last dataset is to predict the tempera- can be well approximated by a function in the subspace ture based on the coordinates (latitude, longitude) on b spanned by the top eigenfunctions of LN . Hence, we the earth surface. All three datasets are designed for propose to learn a target prediction function gb(x) as a regression tasks with real-valued outputs. We choose linear combination of the first s eigenfunctions, i.e., these three datasets because they fit in with our as- sumptions that will be elaborated in section 3.2. Xs b ∗ b g(x) = γj ϕj(x); (5) We randomly choose 90% of the data for training, j=1 and use the rest 10% for testing.
Recommended publications
  • Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks
    Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks Charles H. Martin∗ and Michael W. Mahoney† Tutorial at ACM-KDD, August 2019 ∗Calculation Consulting, [email protected] †ICSI and Dept of Statistics, UC Berkeley, https://www.stat.berkeley.edu/~mmahoney/ Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 1 / 98 Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions Statistical Physics & Neural Networks: A Long History 60s: I J. D. Cowan, Statistical Mechanics of Neural Networks, 1967. 70s: I W. A. Little, “The existence of persistent states in the brain,” Math.
    [Show full text]
  • Multitask Learning with Local Attention for Tibetan Speech Recognition
    Hindawi Complexity Volume 2020, Article ID 8894566, 10 pages https://doi.org/10.1155/2020/8894566 Research Article Multitask Learning with Local Attention for Tibetan Speech Recognition Hui Wang , Fei Gao , Yue Zhao , Li Yang , Jianjian Yue , and Huilin Ma School of Information Engineering, Minzu University of China, Beijing 100081, China Correspondence should be addressed to Fei Gao; [email protected] Received 22 September 2020; Revised 12 November 2020; Accepted 26 November 2020; Published 18 December 2020 Academic Editor: Ning Cai Copyright © 2020 Hui Wang et al. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. *e experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model. 1. Introduction recognition, dialect identification, and speaker recognition in a single model.
    [Show full text]
  • Lecture 9: Generalization
    Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don't just want it to learn to model the training data. We want it to generalize to data it hasn't seen before. Fortunately, there's a very convenient way to measure an algorithm's generalization performance: we measure its performance on a held-out test set, consisting of examples it hasn't seen before. If an algorithm works well on the training set but fails to generalize, we say it is overfitting. Improving generalization (or preventing overfitting) in neural nets is still somewhat of a dark art, but this lecture will cover a few simple strategies that can often help a lot. 1.1 Learning Goals • Know the difference between a training set, validation set, and test set. • Be able to reason qualitatively about how training and test error de- pend on the size of the model, the number of training examples, and the number of training iterations. • Understand the motivation behind, and be able to use, several strate- gies to improve generalization: { reducing the capacity { early stopping { weight decay { ensembles { input transformations { stochastic regularization 2 Measuring generalization So far in this course, we've focused on training, or optimizing, neural net- works. We defined a cost function, the average loss over the training set: N 1 X L(y(x(i)); t(i)): (1) N i=1 But we don't just want the network to get the training examples right; we also want it to generalize to novel instances it hasn't seen before.
    [Show full text]
  • Issues in Using Function Approximation for Reinforcement Learning
    To appear in: Proceedings of the Fourth Connectionist Models Summer School Lawrence Erlbaum Publisher, Hillsdale, NJ, Dec. 1993 Issues in Using Function Approximation for Reinforcement Learning Sebastian Thrun Anton Schwartz Institut fÈur Informatik III Dept. of Computer Science UniversitÈat Bonn Stanford University RÈomerstr. 164, D±53225 Bonn, Germany Stanford, CA 94305 [email protected] [email protected] Reinforcement learning techniques address the problem of learning to select actions in unknown, dynamic environments. It is widely acknowledged that to be of use in complex domains, reinforcement learning techniques must be combined with generalizing function approximation methods such as arti®cial neural networks. Little, however, is understood about the theoretical properties of such combinations, and many researchers have encountered failures in practice. In this paper we identify a prime source of such failuresÐnamely, a systematic overestimation of utility values. Using Watkins' Q-Learning [18] as an example, we give a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail. Employing some of the most popular function approximators, we present experimental results which support the theoretical ®ndings. 1 Introduction Reinforcement learning methods [1, 16, 18] address the problem of learning, through experimentation, to choose actions so as to maximize one's productivity in unknown, dynamic environments. Unlike most learning algorithms that have been studied in the ®eld of Machine Learning, reinforcement learning techniques allow for ®nding optimal action sequences in temporal decision tasks where the external evaluation is sparse, and neither the effects of actions, nor the temporal delay between actions and its effects on the learner's performance is known to the learner beforehand.
    [Show full text]
  • Effective Dimensionality Revisited
    Under review as a conference paper at ICLR 2021 RETHINKING PARAMETER COUNTING: EFFECTIVE DIMENSIONALITY REVISITED Anonymous authors Paper under double-blind review ABSTRACT Neural networks appear to have mysterious generalization properties when using parameter counting as a proxy for complexity. Indeed, neural networks often have many more parameters than there are data points, yet still provide good general- ization performance. Moreover, when we measure generalization as a function of parameters, we see double descent behaviour, where the test error decreases, increases, and then again decreases. We show that many of these properties become understandable when viewed through the lens of effective dimensionality, which measures the dimensionality of the parameter space determined by the data. We relate effective dimensionality to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces, leading to a richer understanding of the interplay between parameters and functions in deep models. We also show that effective dimensionality compares favourably to alternative norm- and flatness- based generalization measures. 1 INTRODUCTION Parameter counting pervades the narrative in modern deep learning. “One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error” (Dziugaite and Roy, 2017). “Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance” (Zhang et al., 2017). “Increasing the number of parameters of neural networks can give much better prediction accuracy” (Shazeer et al., 2017).
    [Show full text]
  • Generalizing to Unseen Domains: a Survey on Domain Generalization
    PREPRINT 1 Generalizing to Unseen Domains: A Survey on Domain Generalization Jindong Wang, Cuiling Lan, Member, IEEE, Chang Liu, Yidong Ouyang, Wenjun Zeng, Fellow, IEEE, Tao Qin, Senior Member, IEEE Abstract—Machine learning systems generally assume that the Sketch Cartoon Art painting Photo training and testing distributions are the same. To this end, a key requirement is to develop models that can generalize to unseen distributions. Domain generalization (DG), i.e., out-of- distribution generalization, has attracted increasing interests in recent years. Domain generalization deals with a challenging setting where one or several different but related domain(s) are given, and the goal is to learn a model that can generalize to Training set Test set an unseen test domain. Great progress has been made in the area of domain generalization for years. This paper presents Fig. 1. Examples from the dataset PACS [1] for domain generalization. The the first review of recent advances in this area. First, we training set is composed of images belonging to domains of sketch, cartoon, provide a formal definition of domain generalization and discuss and art paintings. DG aims to learn a generalized model that performs well several related fields. We then thoroughly review the theories on the unseen target domain of photos. related to domain generalization and carefully analyze the theory behind generalization. We categorize recent algorithms into three classes: data manipulation, representation learning, and learning datasets) that will generalize well on unseen testing domains. strategy, and present several popular algorithms in detail for For instance, given a training set consisting of images coming each category.
    [Show full text]
  • Batch Policy Learning Under Constraints
    Batch Policy Learning under Constraints Hoang M. Le 1 Cameron Voloshin 1 Yisong Yue 1 Abstract deed, many such real-world applications require the primary objective function be augmented with an appropriate set of When learning policies for real-world domains, constraints (Altman, 1999). two important questions arise: (i) how to effi- ciently use pre-collected off-policy, non-optimal Contemporary policy learning research has largely focused behavior data; and (ii) how to mediate among dif- on either online reinforcement learning (RL) with a focus on ferent competing objectives and constraints. We exploration, or imitation learning (IL) with a focus on learn- thus study the problem of batch policy learning un- ing from expert demonstrations. However, many real-world der multiple constraints, and offer a systematic so- settings already contain large amounts of pre-collected data lution. We first propose a flexible meta-algorithm generated by existing policies (e.g., existing driving behav- that admits any batch reinforcement learning and ior, power grid control policies, etc.). We thus study the online learning procedure as subroutines. We then complementary question: can we leverage this abundant present a specific algorithmic instantiation and source of (non-optimal) behavior data in order to learn se- provide performance guarantees for the main ob- quential decision making policies with provable guarantees jective and all constraints. As part of off-policy on both primary objective and constraint satisfaction? learning, we propose a simple method for off- We thus propose and study the problem of batch policy policy policy evaluation (OPE) and derive PAC- learning under multiple constraints.
    [Show full text]
  • Learning Better Structured Representations Using Low-Rank Adaptive Label Smoothing
    Published as a conference paper at ICLR 2021 LEARNING BETTER STRUCTURED REPRESENTATIONS USING LOW-RANK ADAPTIVE LABEL SMOOTHING Asish Ghoshal, Xilun Chen, Sonal Gupta, Luke Zettlemoyer & Yashar Mehdad {aghoshal,xilun,sonalgupta,lsz,mehdad}@fb.com Facebook AI ABSTRACT Training with soft targets instead of hard targets has been shown to improve per- formance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found wide- spread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC- Bayesian generalization bounds for label smoothing and show that the generaliza- tion error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifi- cally, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.
    [Show full text]
  • Learning Universal Graph Neural Network Embeddings with Aid of Transfer Learning
    Learning Universal Graph Neural Network Embeddings With Aid Of Transfer Learning Saurabh Verma Zhi-Li Zhang Department of Computer Science Department of Computer Science University of Minnesota Twin Cities University of Minnesota Twin Cities [email protected] [email protected] Abstract Learning powerful data embeddings has become a center piece in machine learn- ing, especially in natural language processing and computer vision domains. The crux of these embeddings is that they are pretrained on huge corpus of data in a unsupervised fashion, sometimes aided with transfer learning. However currently in the graph learning domain, embeddings learned through existing graph neural networks (GNNs) are task dependent and thus cannot be shared across different datasets. In this paper, we present a first powerful and theoretically guaranteed graph neural network that is designed to learn task-independent graph embed- dings, thereafter referred to as deep universal graph embedding (DUGNN). Our DUGNN model incorporates a novel graph neural network (as a universal graph encoder) and leverages rich Graph Kernels (as a multi-task graph decoder) for both unsupervised learning and (task-specific) adaptive supervised learning. By learning task-independent graph embeddings across diverse datasets, DUGNN also reaps the benefits of transfer learning. Through extensive experiments and ablation studies, we show that the proposed DUGNN model consistently outperforms both the existing state-of-art GNN models and Graph Kernels by an increased accuracy of 3% − 8% on graph classification benchmark datasets. 1 Introduction Learning powerful data embeddings has become a center piece in machine learning for producing superior results. This new trend of learning embeddings from data can be attributed to the huge success of word2vec [31, 35] with unprecedented real-world performance in natural language processing (NLP).
    [Show full text]
  • 1 Introduction 2 the Learning Problem
    9.520: Statistical Learning Theory and Applications February 8th, 2010 The Learning Problem and Regularization Lecturer: Tomaso Poggio Scribe: Sara Abbasabadi 1 Introduction Today we will introduce a few basic mathematical concepts underlying conditions of the learning theory and generalization property of learning algorithms. We will dene empirical risk, expected risk and empirical risk minimization. Then regularization algorithms will be introduced which forms the bases for the next few classes. 2 The Learning Problem Consider the problem of supervised learning where for a number of input data X, the output labels Y are also known. Data samples are taken independently from an underlying distribution µ(z) on Z = X × Y and form the training set S: (x1; y1);:::; (xn; yn) that is z1; : : : ; zn. We assume independent and identically distributed (i.i.d.) random variables which means the order does not matter (this is the exchangeability property). Our goal is to nd the conditional probability of y given a data x. µ(z) = p(x; y) = p(yjx) · p(x) p(x; y) is xed but unknown. Finding this probabilistic relation between X and Y solves the learning problem. XY P(y|x) P(x) For example suppose we have the data xi;Fi measured from a spring. Hooke's law F = ax gives 2 2 p(F=x) = δ(F −ax). For additive Gaussian noise Hooke's law F = ax gives p(F=x) = e−(F −ax) =2σ . 2-1 2.1 Hypothesis Space Hypothesis space H should be dened in every learning algorithm as the space of functions that can be explored.
    [Show full text]
  • Principles and Algorithms for Forecasting Groups of Time Series: Locality and Globality
    Principles and Algorithms for Forecasting Groups of Time Series: Locality and Globality Pablo Montero-Manso Rob J Hyndman Discipline of Business Analytics Department of Econometrics and Business Statistics University of Sydney Monash University Darlington, NSW 2006, Australia Clayton, VIC 3800, Australia [email protected] [email protected] March 30, 2021 Abstract Forecasting of groups of time series (e.g. demand for multiple products offered by a retailer, server loads within a data center or the number of completed ride shares in zones within a city) can be approached locally, by considering each time series as a separate regression task and fitting a function to each, or globally, by fitting a single function to all time series in the set. While global methods can outperform local for groups composed of similar time series, recent empirical evidence shows surprisingly good performance on heterogeneous groups. This suggests a more general applicability of global methods, potentially leading to more accurate tools and new scenarios to study. However, the evidence has been of empirical nature and a more fundamental study is required. Formalizing the setting of forecasting a set of time series with local and global methods, we provide the following contributions: • We show that global methods are not more restrictive than local methods for time series forecasting, a result which does not apply to sets of regression problems in general. Global and local methods can produce the same forecasts without any assumptions about similarity of the series in the set, therefore global models can succeed in a wider range of problems than previously thought.
    [Show full text]
  • A Theoretical Analysis of Contrastive Unsupervised Representation Learning
    A Theoretical Analysis of Contrastive Unsupervised Representation Learning Sanjeev Arora 1 2 Hrishikesh Khandeparkar 1 Mikhail Khodak 3 Orestis Plevrakis 1 Nikunj Saunshi 1 farora, hrk, orestisp, [email protected] [email protected] Abstract the softmax) of a powerful deep net trained on ImageNet. In natural language processing (NLP), low-dimensional rep- Recent empirical works have successfully used resentations of text – called text embeddings – have been unlabeled data to learn feature representations computed with unlabeled data (Peters et al., 2018; Devlin that are broadly useful in downstream classifica- et al., 2018). Often the embedding function is trained by tion tasks. Several of these methods are remi- using the embedding of a piece of text to predict the sur- niscent of the well-known word2vec embedding rounding text (Kiros et al., 2015; Logeswaran & Lee, 2018; algorithm: leveraging availability of pairs of se- Pagliardini et al., 2018). Similar methods that leverage simi- mantically “similar” data points and “negative larity in nearby frames in a video clip have had some success samples,” the learner forces the inner product of for images as well (Wang & Gupta, 2015). representations of similar pairs with each other to be higher on average than with negative samples. Many of these algorithms are related: they assume access to The current paper uses the term contrastive learn- pairs or tuples (in the form of co-occurrences) of text/images ing for such algorithms and presents a theoretical that are more semantically similar than randomly sampled framework for analyzing them by introducing la- text/images, and their objective forces representations to tent classes and hypothesizing that semantically respect this similarity on average.
    [Show full text]