Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets

Total Page:16

File Type:pdf, Size:1020Kb

Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets Nedjma Ousidhoum, Yangqiu Song, Dit-Yan Yeung Department of Computer Science and Engineering The Hong Kong University of Science and Technology [email protected], [email protected], [email protected] Abstract been shown, by looking at historical documents, that we may somehow neglect the data collection Work on bias in hate speech typically aims to improve classification performance while rela- process (Jo and Gebru, 2020). Thus, in the present tively overlooking the quality of the data. We work, we are interested in improving hate speech examine selection bias in hate speech in a lan- data collection with evaluation before focusing on guage and label independent fashion. We first classification performance. use topic models to discover latent semantics We conduct a comparative study on English, in eleven hate speech corpora, then, we present French, German, Arabic, Italian, Portuguese, and two bias evaluation metrics based on the se- Indonesian datasets using topic models, specifi- mantic similarity between topics and search words frequently used to build corpora. We cally Latent Dirichlet Allocation (LDA) (Blei et al., discuss the possibility of revising the data col- 2003). We use multilingual word embeddings or lection process by comparing datasets and an- word associations to compute the semantic simi- alyzing contrastive case studies. larity scores between topic words and predefined keywords and define two metrics that calculate bias 1 Introduction in hate speech based on these measures. We use the Hate speech in social media dehumanizes minori- same list of keywords reported by Ross et al.(2016) ties through direct attacks or incitement to defama- for German, Sanguinetti et al.(2018) for Italian, tion and aggression. Despite its scarcity in compar- Ibrohim and Budi(2019) for Indonesian, Fortuna ison to normal web content, Mathew et al.(2019) et al.(2019) for Portuguese; allow more flexibility demonstrated that it tends to reach large audiences in both English (Waseem and Hovy, 2016; Founta faster due to dense connections between users who et al., 2018; Ousidhoum et al., 2019) and Ara- share such content. Hence, a search based on bic (Albadi et al., 2018; Mulki et al., 2019; Ousid- generic hate speech keywords or controversial hash- houm et al., 2019) in order to compare different tags may result in a set of social media posts gen- datasets based on shared concepts that have been erated by a limited number of users (Arango et al., reported in their respective paper descriptions; and 2019). This would lead to an inherent bias in hate for French, we make use of a subset of keywords speech datasets similar to other tasks involving so- that covers most of the targets reported by Ousid- cial data (Olteanu et al., 2019) as opposed to a houm et al.(2019). Our first bias evaluation metric selection bias (Heckman, 1977) particular to hate measures the average similarity between topics and speech data. the whole set of keywords, and the second one eval- Mitigation methods usually point out the classi- uates how often keywords appear in topics. We fication performance and investigate how to debias analyze our methods through different use cases the detection given false positives caused by gen- which explain how we can benefit from the assess- der group identity words such as “women” (Park ment. et al., 2018), racial terms reclaimed by communi- Our main contributions consist of (1) designing ties in certain contexts (Davidson et al., 2019), or bias metrics that evaluate hateful web content us- names of groups that belong to the intersection of ing topic models; (2) examining selection bias in gender and racial terms such as “black men” (Kim eleven datasets; and (3) turning present hate speech et al., 2020). The various aspects of the dataset corpora into an insightful resource that may help us construction are less studied though it has recently balance training data and reduce bias in the future. 2532 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2532–2542, November 16–20, 2020. c 2020 Association for Computational Linguistics 2 Related Work such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have proven their efficiency to han- Hate speech labeling schemes depend on the gen- dle several NLP applications such as data explo- eral purpose of the dataset. The annotations may ration (Rodriguez and Storer, 2020), Twitter hash- include hateful vs. non hateful (Basile et al., 2019), tag recommendation (Godin et al., 2013), author- racist, sexist, and none (Waseem and Hovy, 2016); ship attribution (Seroussi et al., 2014), and text as well as discriminating target attributes (ElSherief categorization (Zhou et al., 2009). et al., 2018), the degree of intensity (Sanguinetti In order to evaluate the consistency of the gen- et al., 2018), and the annotator’s sentiment towards erated topics, Newman et al.(2010) used crowd- the tweets (Ousidhoum et al., 2019). Besides En- sourcing and semantic similarity metrics, essen- glish (Basile et al., 2019; Waseem and Hovy, 2016; tially based on Pointwise Mutual Information Davidson et al., 2017; ElSherief et al., 2018; Founta (PMI), to assess the coherence; Mimno et al. et al., 2018; Qian et al., 2018), we notice a growing (2011) estimated coherence scores using condi- interest in the study of hate speech in other lan- tional log-probability instead of PMI; Lau et al. guages, such as Portuguese (Fortuna et al., 2019), (2014) enhanced this formulation based on normal- Italian (Sanguinetti et al., 2018), German (Ross ized PMI (NPMI); and Lau and Baldwin(2016) in- et al., 2016), Indonesian (Ibrohim and Budi, 2019), vestigated the effect of cardinality on topic genera- French (Ousidhoum et al., 2019), Dutch (Hee et al., tion. Similarly, we use topics and semantic similar- 2015), and Arabic (Albadi et al., 2018; Mulki et al., ity metrics to determine the quality of hate speech 2019; Ousidhoum et al., 2019). Challenging ques- datasets, and test on corpora that vary in language, tions being tackled in this area involve the way size, and general collection purposes for the sake abusive language spreads online (Mathew et al., of examining bias up to different facets. 2019), fast changing topics during data collec- tion (Liu et al., 2019), user bias in publicly avail- 3 Bias Estimation Method able datasets (Arango et al., 2019), bias in hate speech classification and different methods to re- The construction of toxic language and hate speech duce it (Park et al., 2018; Davidson et al., 2019; corpora is commonly conducted based on keywords Kennedy et al., 2020). and/or hashtags. However, the lack of an unequiv- Bias in social data is broad and addresses a wide ocal definition of hate speech, the use of slurs range of issues (Olteanu et al., 2019; Papakyri- in friendly conversations as opposed to sarcasm akopoulos et al., 2020). Shah et al.(2020) present and metaphors in elusive hate speech (Malmasi a framework to predict the origin of different types and Zampieri, 2018), and the data collection time- of bias including label bias (Sap et al., 2019), se- line (Liu et al., 2019) contribute to the complexity lection bias (Garimella et al., 2019), model over- and imbalance of the available datasets. Therefore, amplification (Zhao et al., 2017), and semantic training hate speech classifiers easily produces bias (Garg et al., 2018). Existing work deals with false positives when tested on posts that contain bias through the construction of large datasets and controversial or search-related identity words (Park the definition of social frames (Sap et al., 2020), et al., 2018; Sap et al., 2019; Davidson et al., 2019; the investigation of how current NLP models might Kim et al., 2020). be non-inclusive of marginalized groups such as To claim whether a dataset is rather robust to people with disabilities (Hutchinson et al., 2020), keyword-based selection or not, we present two mitigation (Dixon et al., 2018; Sun et al., 2019), label-agnostic metrics to evaluate bias using topic or better data splits (Gorman and Bedrick, 2019). models. First, we generate topics using Latent However, Blodgett et al.(2020) report a missing Dirichlet Allocation (LDA) (Blei et al., 2003). normative process to inspect the initial reasons be- Then, we compare topics to predefined sets of key- hind bias in NLP without the main focus being words using a semantic similarity measure. We test on the performance which is why we choose to our methods on different numbers of topics and investigate the data collection process in the first topic words. place. In order to operationalize the evaluation of selec- 3.1 Predefined Keywords tion bias, we use topic models to capture latent se- In contrast to Waseem(2016), who legitimately mantics. Regularly used topic modeling techniques questions the labeling process by comparing ama- 2533 DATASET KEYWORDS DATASET TOPIC WORDS Ousidhoum et al.(2019) ni**er, invasion, attack Founta et al.(2018) f***ing, like, know Waseem and Hovy(2016) Ousidhoum et al.(2019) ret***ed, sh*t**le, c*** Founta et al.(2018) Waseem and Hovy(2016) sexist, andre, like Ousidhoum et al.(2019) FR migrant, sale, m*ng*l Ousidhoum et al.(2019) FR m*ng*l, gauchiste, sale EN migrant, filthy, mong****d EN mon*y, leftist, filthy Albadi et al.(2018) AR Albadi et al.(2018) AR QK Qg ,Q ªJ.Ë@ , è@QÓ éJ j ÖÏ@ ,XñîDË@ , éªJ Ë@ Ousidhoum et al.(2019) EN woman, camels, pig EN Shia, Jewish, Christanity Mulki et al.(2019)
Recommended publications
  • Anima Anandkumar
    Anima Anandkumar Director of Machine Learning Research, NVIDIA "Anima is at the forefront of AI, as both a theorist and a praconer" Anima Anandkumar holds dual posions in academia and industry. She is a Bren professor at Caltech CMS department and a director of machine learning research at NVIDIA. At NVIDIA, she is leading the research group that develops next-generaon AI algorithms. At Caltech, she is the co-director of Dolcit and co-leads the AI4science iniave. TOPICS: IN DETAIL: AI Security Anima is a former Principal Scienst at Amazon Web Services (AWS) where she Artificial Intelligence was head of research and development providing enterprises with arficial Big Data intelligence tools. Anima is also a champion for enabling AI methods and tools in Empowering Women every part of the cloud and for the kind of progressive, incremental Futurism Technology improvements in performance that can make a big difference over me - the Diversity in Technology same kind of processes with which machines learn. Anima holds the endowed Machine Learning Bren chair at CalTech. She has received numerous awards and honors, including a Can Artificial Intelligence be Conscious Microso Faculty Fellowship, a Google faculty award, and a number of other too? research and best paper awards. She has been featured in several documentaries Bridging the Gap Between Artificial and and media arcles. Human Intelligence WHAT SHE OFFERS YOU: LANGUAGES: Anima Anandkumar offers that rare combinaon of leading-edge academic She presents in English. research with equally deep experience in business and praccal applicaons. She specialises in using AI in cloud compung to solve real world problems.
    [Show full text]
  • Tensor Regression Networks
    Journal of Machine Learning Research 21 (2020) 1-21 Submitted 7/18; Published 7/20 Tensor Regression Networks Jean Kossaifi [email protected] NVIDIA & Imperial College London Zachary C. Lipton [email protected] Carnegie Mellon University Arinbj¨ornKolbeinsson [email protected] Imperial College London Aran Khanna [email protected] Amazon AI Tommaso Furlanello [email protected] University of Southern California Anima Anandkumar [email protected] NVIDIA & California Institute of Technology Editor: Aapo Hyvarinen Abstract Convolutional neural networks typically consist of many convolutional layers followed by one or more fully connected layers. While convolutional layers map between high-order activation tensors, the fully connected layers operate on flattened activation vectors. Despite empirical success, this approach has notable drawbacks. Flattening followed by fully connected layers discards multilinear structure in the activations and requires many parameters. We address these problems by incorporating tensor algebraic operations that preserve multilinear structure at every layer. First, we introduce Tensor Contraction Layers (TCLs) that reduce the dimensionality of their input while preserving their multilinear structure using tensor contraction. Next, we introduce Tensor Regression Layers (TRLs), which express outputs through a low-rank multilinear mapping from a high-order activation tensor to an output tensor of arbitrary order. We learn the contraction and regression factors end-to-end, and produce accurate nets with fewer parameters. Additionally, our layers regularize networks by imposing low-rank constraints on the activations (TCL) and regression weights (TRL). Experiments on ImageNet show that, applied to VGG and ResNet architectures, TCLs and TRLs reduce the number of parameters compared to fully connected layers by more than 65% while maintaining or increasing accuracy.
    [Show full text]
  • Deep Learning in Unconventional Domains
    Deep Learning in Unconventional Domains Thesis by Milan Cvitkovic In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CALIFORNIA INSTITUTE OF TECHNOLOGY Pasadena, California 2020 Defended March 10, 2020 ii © 2020 Milan Cvitkovic ORCID: 0000-0003-4188-452X All rights reserved iii ACKNOWLEDGEMENTS To my parents, Kate Turpin and Michael Cvitkovic, and my sister, Adriana Cvitkovic: thank you for your boundless love and support. And all the teasing. To my thesis committee, Prof. Anima Anandkumar, Prof. Yisong Yue, Prof. Adam Wierman, and Prof. Thomas Vidick: thank you for your invaluable guidance and encouragement throughout my time at Caltech, both in your roles as committee members and beyond. To Prof. Anima Anandkumar, Prof. Pietro Perona, and Prof. Stefano Soatto: thank you for facilitating my affiliation with Amazon Web Services, without which this dissertation would never have been possible. To all my coauthors who helped with the work in this thesis — Badal Singh, Prof. Anima Anandkumar, Dr. Günther Koliander, and Dr. Zohar Karnin — and to all my other collaborators during my PhD — Joe Marino, Dr. Grant Van Horn, Dr. Hamed Alemohammad, Prof. Yisong Yue, Keng Wah Loon, Laura Graesser, Jiahao Su, and Prof. Furong Huang: thank you for your generosity and your brilliance. To Lee, Carmen, Kirsten, Leslie, James, Alex, Josh, Claudia, and Patricia: thank you for being cheerful stewards of daily rituals. And most of all to my advisor, Prof. Thomas Vidick: thank you for your unwavering support of a non-traditional advisee. iv ABSTRACT Machine learning methods have dramatically improved in recent years thanks to advances in deep learning (LeCun, Bengio, and Hinton, 2015), a set of methods for training high-dimensional, highly-parameterized, nonlinear functions.
    [Show full text]
  • Anima Anandkumar
    ANIMA ANANDKUMAR DISTRIBUTED DEEP LEARNING PRACTICAL CONSIDERATIONS FOR MACHINE LEARNING SOFTWARE PACKAGES TITLE OF SLIDE UTILITY DIVERSE & COMPUTING LARGE DATASETS CHALLENGES IN DEPLOYING LARGE-SCALE LEARNING TITLE OF SLIDE CHALLENGES IN DEPLOYING LARGE-SCALE LEARNING • Complex deep network TITLE OF SLIDE • Coding from scratch is impossible • A single image requires billions floating-point operations • Intel i7 ~500 GFLOPS • Nvidia Titan X: ~5 TFLOPS • Memory consumption is linear with number of layers DESIRABLE ATTRIBUTES IN A ML SOFTWARE PACKAGE {…} TITLE OF SLIDE PROGRAMMABILITY PORTABILITY EFFICIENCY Simplifying network Efficient use In training definitions of memory and inference TensorFlow TITLE OF SLIDE Torch CNTK MXNET IS AWS’S DEEP LEARNING TITLE OF SLIDE FRAMEWORK OF CHOICE MOST OPEN BEST ON AWS Apache (Integration with AWS) {…} {…} PROGRAMMABILITY frontend backend TITLE OF SLIDE single implementation of performance guarantee backend system and regardless which front- common operators end language is used IMPERATIVE PROGRAMMING PROS • Straightforward and flexible. import numpy as np a = np.ones(10) • Take advantage of language b = np.ones(10) * 2 native features (loop, c = b * a condition, debugger) TITLE OF SLIDE cd = c + 1 • E.g. Numpy, Matlab, Torch, … Easy to tweak CONS with python codes • Hard to optimize DECLARATIVE PROGRAMMING A = Variable('A') PROS B = Variable('B') • A B More chances for optimization C = B * A • Cross different languages D = C + 1 X 1 f = compile(D) • E.g. TensorFlow, Theano, + TITLE OF SLIDE d = f(A=np.ones(10),
    [Show full text]
  • Communication-Efficient Distributed SGD with Sketching
    Communication-efficient Distributed SGD with Sketching Nikita Ivkin ∗y Daniel Rothchild ∗ Enayat Ullah ∗ Amazon UC Berkeley Johns Hopkins University [email protected] [email protected] [email protected] Vladimir Braverman z Ion Stoica Raman Arora Johns Hopkins University UC Berkeley Johns Hopkins University [email protected] [email protected] [email protected] Abstract Large-scale distributed training of neural networks is often limited by network band- width, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce SKETCHED-SGD4, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that SKETCHED-SGD has favorable convergence rates on several classes of functions. When considering all communication – both of gradients and of updated model weights – SKETCHED- SGD reduces the amount of communication required compared to other gradient compression methods from O(d) or O(W ) to O(log d), where d is the number of model parameters and W is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that SKETCHED-SGD scales to at least 256 workers without increasing communication cost or degrading model performance. 1 Introduction Modern machine learning training workloads are commonly distributed across many machines using data-parallel synchronous stochastic gradient descent. At each iteration, W worker nodes split a mini-batch of size B; each worker computes the gradient of the loss on its portion of the data, and then a parameter server sums each worker’s gradient to yield the full mini-batch gradient.
    [Show full text]
  • View a List of the Invited Faculty Members
    COURSE DIRECTORS Michael Wallace, MD, MPH, FASGE Prateek Sharma, MD, FASGE Professor of Medicine Professor of Medicine Gastroenterology and Hepatology Gastroenterology and Hepatology Mayo Clinic University of Kansas School of Medicine Jacksonville, Florida Kansas City, Kansas FACULTY Omer Ahmad, BSc, MBBS, MRCP Philip Chiu, MD, MBChB, FRCSEd, FCSHK, FHKAM Clinical Research Fellow (Surgery) Wellcome/EPSRC Centre for Interventional and Professor and Head of Division of Upper Surgical Sciences (WEISS) Gastrointestinal & Metabolic Surgery Surgical Robot Vision Research Group University College London Department of Surgery, The Chinese University of London, England Hong Kong Director of Endoscopy, Prince of Wales Hospital Anima Anandkumar, PhD Hong Kong Bren Professor Microsoft and Sloan Fellow Jonathan Cohen, MD, FASGE California Institute of Technology Clinical Professor Director of Machine Learning, NVIDIA New York University School of Medicine Irvine, California New York, New York David Armstrong, MD Giovanni Di Napoli, MBA Associate Professor President, Gastrointestinal McMaster University Medtronic, Inc. Ontario, Canada Sunnyvale, California Ulas Bagci, PhD Seth Gross, MD, FASGE SAIC Chair Professor Associate Professor, Department of Medicine Center for Research in Computer Vision Director, Clinical Care and Quality University of Central Florida Division of Gastroenterology Orlando, Florida NYU Langone Health New York, New York Tyler Berzin, MD, FASGE Co-Director, GI Endoscopy Aashima Gupta, MS Beth Israel Deaconess Medical Center Director, Global Healthcare Strategy and Solutions Director, Advanced Therapeutic Endoscopy Google Healthcare Cloud Fellowship San Francisco, California Assistant Professor of Medicine Harvard Medical School Cesare Hassan, MD Boston, Massachusetts Endoscopy Unit Nuovo Regina Margherita Hospital Rahul Bhotika, PhD Rome, Italy Amazon Computer Vision Director Shani Haugen, PhD Seattle, Washington Assistant Director, Gastroenterology and Endoscopy Devices Team (THT3A2) Anthony Borelli, MBA U.S.
    [Show full text]
  • Riseandgrind: Lessons from a Biased AI Conor Mcgarrigle Dublin School of Creative Arts, TU Dublin
    RADICAL IMMERSIONS: DRHA 2019 ---------------------------------------------------- #Riseandgrind: Lessons From a Biased AI Conor McGarrigle Dublin School of Creative Arts, TU Dublin Abstract: #RiseandGrind is a research-based artwork that, through a process of active engagement with the machine-learning tools of what is known as artificial intelligence, sought to make visible the complex relationship between the origins and context of training data and the results that are produced through the training process. The project using textual data extracted from Twitter hashtags that exhibit clear bias to train a recurrent neural network (RNN) to generate text for a Twitter bot, with the process of training and text generation represented in a series of gallery installations. The process demonstrated how original bias is consolidated, amplified, and ultimately codified through this machine learning process. It is suggested that this is not only reproductive of the original bias but also constitutive, in that black- box machine learning models shape the output but not in ways that are readily apparent or understood. This paper discusses the process of creating and exhibiting the work and reflects on its outcomes. Keywords: Twitter; Machine-learning; artificial intelligence; new media art; generative art; 1. Introduction #RiseandGrind, is an AI art project using textual data extracted from Twitter hashtags to train a recurrent neural network (RNN) to generate text for a Twitter bot, with the process of training and text generation represented in a series of gallery installations. The work was first commissioned in 2018 and exhibited in three different iterations in 2018 and 2019. The neural network was trained on two selected hashtags, #RiseandGrind and #Hustle, specifically chosen as representative of a Twitter filter bubble that I identify as embodied neoliberal precarity.
    [Show full text]
  • Curriculum Vitae
    Yisong Yue Contact California Institute of Technology website: www.yisongyue.com Information 1200 E. California Blvd. email: [email protected] CMS, 305-16 Pasadena, CA 91125 Research Theory and application of statistical machine learning, with a particular focus on developing Interests novel methods for interactive machine learning and structured machine learning. Research California Institute of Technology September 2014 - Present Appointments Position: Professor (May 2020 - Present) Previously: Assistant Professor (September 2014 - May 2020) Disney Research August 2013 - August 2014 Position: Research Scientist Carnegie Mellon University September 2010 - August 2013 Position: Postdoctoral Researcher Supervisors: Carlos Guestrin and Ramayya Krishnan Cornell University May 2006 - August 2010 Position: Research Assistant Supervisors: Thorsten Joachims and Robert Kleinberg Google June 2009 - September 2009 Position: Search Quality Analyst Intern Supervisor: Rajan Patel Microsoft Research May 2007 - August 2007 Position: Research Intern Supervisor: Christopher Burges Education Cornell University Ph.D. January 2011 Ph.D. in Computer Science Graduate Minor in Statistics Dissertation: New Learning Frameworks for Information Retrieval Thesis Committee: Thorsten Joachims (advisor), Robert Kleinberg, Christopher Burges, Ping Li, John Hopcroft University of Illinois at Urbana-Champaign B.S. June 2005 Bachelor of Science in Computer Science Graduated with Highest Honors (Summa Cum Laude) Illinois Math and Science Academy 1998 - 2001 Honors and Okawa
    [Show full text]
  • Lecture 17: March 16Th 2019 17.1 Introduction
    CS 165: Foundations of Machine Learning Winter 2019 Lecture 17: March 16th 2019 Lecturer: Anima Anandkumar Scribes: Qilin Li, Wen Gu We will label each section with the corresponding slides page number in () for your reference. 17.1 Introduction 17.1.1 Overview (2 ∼ 8) Machine learning, from a toy example to the applications in real life, has a broader implication in terms of bias, fairness, ethics and all the tricky issues. The mathematical model, data set used, and design of architecture for machine learning are all considered in ideal cases, but the things will change once they are brought to real life. In fact, there are more questions than answers regarding the broader implications rather than the traditional topic covered in the area of machine learning. What's more, the technology may not be able to reduce bias and prejudice as we expect. Our history, bias, and prejudices will impact the way we create the data sets. These biased data sets will be used to train the AI models, and the biased AI models will affect the way they make decisions. Consequently, these decisions will also influence the human's social issues such as bias and prejudices. This closed loop amplifies the social problems when we do not take consideration of the side effect of technology. For example, the ImageNet data set contains a lot of dogs and cats, but this data set cannot represent the whole picture of the real world in the sense that we have many things other than dogs and cats. In addition, modern technology is led by western civilization, and thus the value holds by western culture would also impact the way data set is created and the way technology is used.
    [Show full text]
  • Guaranteed Training of Neural Networks Using Tensor Methods
    Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods Majid Janzamin∗ Hanie Sedghi† Anima Anandkumar‡ Abstract Training neural networks is a challenging non-convex optimization problem, and backpropa- gation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We pro- vide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary tar- get functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. Keywords: neural networks, risk bound, method-of-moments, tensor decomposition 1 Introduction Neural networks have revolutionized performance across multiple domains such as computer vision and speech recognition. They are flexible models trained to approximate any arbitrary target function, e.g., the label function for classification tasks. They are composed of multiple layers of neurons or activating functions, which are applied recursively on the input data, in order to predict the output. While neural networks have been extensively employed in practice, a complete theoretical understanding is currently lacking.
    [Show full text]
  • Conference on Learning Theory 2016: Preface
    JMLR: Workshop and Conference Proceedings vol 49:1–3, 2016 Conference on Learning Theory 2016: Preface Vitaly Feldman [email protected] IBM Research – Almaden Alexander Rakhlin [email protected] Department of Statistics, University of Pennsylvania These proceedings contain the 63 papers accepted to and presented at the 29th Conference on Learning Theory (COLT), held in New York, USA on June 23-26, 2016. These papers were selected by the program committee with additional help from external expert reviewers from 199 submissions. Of the 63 papers, 18 were given a 20-minute presentation, the remaining 45 a 10- minute presentation. These proceedings also contain the 6 submissions to the Open Problems session. Selection of the submissions was handled by the Open Problem Chair, Shai Shalev-Shwartz. In addition to the papers and open problems published in these proceedings, the conference program also included two invited talks: “Randomized Algorithms in Linear Algebra” by Ravi Kannan (Microsoft Research), and “Testing properties of distributions over big domains” by Ronitt Rubinfeld (MIT and Tel-Aviv University). The paper “Multi-scale Exploration of Convex Functions and Bandit Convex Optimization” by Sebastien´ Bubeck and Ronen Eldan received the Best Paper Award, and the paper “Provably Manipulation-resistant Reputation Systems” by Paul Christiano received the Best Student Paper Award. The local arrangements chairs were Daniel Hsu and Satyen Kale, and the publication chair was Ohad Shamir. We would like to express our gratitude to the entire program committee and to the external reviewers for their invaluable contributions to the success of conference. Finally, we would like to thank our generous sponsors: Two Sigma; Engineers Gate; Microsoft; Baidu Research; Facebook, Columbia University Data Science Institute; Google; Amazon; and Springer’s Machine Learning Journal.
    [Show full text]
  • Neural Operator: Graph Kernel Network for Partial Differential Equations
    Neural Operator: Graph Kernel Network for Partial Differential Equations Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar March 10, 2020 Abstract The classical development of neural networks has been primarily for mappings between a finite-dimensional Euclidean space and a set of classes, or between two finite-dimensional Euclidean spaces. The purpose of this work is to generalize neural networks so that they can learn mappings between infinite-dimensional spaces (operators). The key innovation in our work is that a single set of network parameters, within a carefully designed network architecture, may be used to describe mappings between infinite-dimensional spaces and between different finite- dimensional approximations of those spaces. We formulate approximation of the infinite-dimensional mapping by composing nonlinear activation functions and a class of integral operators. The kernel integration is computed by message passing on graph networks. This approach has substantial practical consequences which we will illustrate in the context of mappings between input data to partial differential equations (PDEs) and their solutions. In this context, such learned networks can generalize among different approximation methods for the PDE (such as finite difference or finite element methods) and among approximations corresponding to different underlying levels of resolution and discretization. Experiments confirm that the proposed graph kernel network does have the desired properties and show competitive performance compared to the state of the art solvers. 1 INTRODUCTION There are numerous applications in which it is desirable to learn a mapping between Banach spaces. In particular, either the input or the output space, or both, may be infinite-dimensional.
    [Show full text]