To Text-Based Games

Total Page:16

File Type:pdf, Size:1020Kb

To Text-Based Games Submitted by Fabian Paischer Submitted at Institute for Machine Learning Applying Return De- Supervisor Univ.-Prof. Dr. Sepp composition for De- Hochreiter Co-Supervisor Jose Arjona Medina, layed Rewards (RUD- PhD DER) to Text-Based September, 2020 Games Master Thesis to obtain the academic degree of Master of Science in the Master’s Program Bioinformatics JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696 Abstract The advent of text-based games tracks back to the invention of the first computers being only able to display and interact with text in the form of ASCII characters. With advancing technology in computer graphics, such games eventually fell into oblivion; however they provide a great environment for machine learning algorithms to learn language understanding and common sense reasoning simultaneously solely based on interaction. A vast variety of text-based games have been developed spanning across multiple domains. Recent work has proved navigating through text-based worlds to be extremely cumbersome for reinforcement learning algorithms. State-of-the-art agents reach reasonable performance on achieving fairly easy quests only. This work focuses on solving text-based games via reinforcement learning within the TextWorld [Côté et al., 2018] framework. A substantial amount of this work is based on recent work done by [Jain et al., 2019] and [Arjona-Medina et al., 2018]. First a reproducibility study of the work done by [Jain et al., 2019] is conducted to demonstrate that reproducibility in reinforcement learning remains a common problem. No work on return decomposition and reward redistribution in the realm of text-based environments has been done prior to this work, Thus, a feasibility study was conducted. Further, hyperparameter search was performed to find the best possible parameters for continuous return prediction and regularization. Reward redistributions are exhibited for randomly sampled episodes by taking the difference between adjacent return predictions. Further, a novel training procedure for training agents to navigate through text-based environments by incorporating return prediction as critic network is presented. The actor is pre-trained with deep Q-learning according to [Jain et al., 2019] and fine-tuned based on proximal policy optimization [Schulman et al., 2017] considering the redistributed rewards of the critic as advantage. The agent achieves comparable results compared to [Jain et al., 2019] on handcrafted benchmark games created within the TextWorld framework establishing a baseline using policy gradients. For simpler benchmark games the agent performs comparable to [Jain et al., 2019]; however fine-tuning with policy gradients enables the agent to recover from runs which performed particularly worse using Q-learning. For more advanced games fine-tuning with proximal policy optimization is capable of recovering when Q-learning stagnates and shows improvements within a small number of training steps. A longer fine-tuning phase or policy gradient training from scratch might even yield superior performance for the benchmark games. Zusammenfassung Text basierte Spiele stammen aus einer Zeit, in der die ersten Computer entwickelt wurden und nur in der Lage waren, Text in Form von ASCII Zeichen auszugeben. Durch den Fortschritt in der Computer Grafik gerieten diese Spiele jedoch in Vergessenheit. Sie bieten aber eine interessante und schwierige Umgebung für Machine Learning Algorithmen, um Sprachverstehen, sowie logisches Denken gleichzeitig nur durch Interaktion mit der Spielumgebung zu lernen. Eine enorme Menge an verschiedenen Text basierten Spielen wurde seither entwickelt über die verschiedensten Domänen. Kürzlich durchgeführte Arbeiten auf diesem Gebiet bestätigen die Schwierigkeit für Reinforcement Learning Algorithmen durch solche Spielumgebungen zu navigieren. Derzeitige State-of-the-art Algorithmen sind nur in der Lage einfache Aufgaben zu lösen. In dieser Arbeit beschäftige ich mich mit dem Lösen Text-basierter Spiele mit Reinforcement Learning im TextWorld [Côté et al., 2018] framework. Ein Großteil meiner Arbeit basiert auf kürzliche Publikationen von [Jain et al., 2019] und [Arjona-Medina et al., 2018]. Als erstes führe ich eine Reproduktionsstudie basierend auf der Publikation von [Jain et al., 2019] durch und zeige, dass Reproduktion von Resultaten in Reinforcement Learning ein weit verbreitetes Problem ist. Danach begebe ich mich in das Gebiet der Vorhersage von Returns im Reinforcement learning und zeige, dass dies in Text-basierten Anwendungen möglich ist. Durch die Vorhersage von Returns werden einige neue Parameter eingeführt für kontinuierliche Vorhersage und Regularisierung, welche ich für optimale Resultate anpasse. Weiters zeige ich, dass es möglich ist eine Umverteilung des gesamten Returns einer gespielten Episode umzuverteilen durch die Differenz benachbarter kontinuierlichen Vorhersagen. Ich gebe Vorschläge wie man die Vorhersage und Umverteilung von Returns weiter verbessern könnte. Schlussendlich schlage ich eine neue Trainingsmethode für Reinforcement Learning Algorithmen vor, indem ich die Vorhersage von Returns mit einem Critic Netzwerk realisiere. Das Actor Netzwerk wurde mit Deep Q-learning trainiert nach der Trainingsmethode von [Jain et al., 2019] und wird basierend auf der Policy Gradient Methode weiter trainiert basierend auf der Proximal Policy Optimization Schulman et al. [2017] Update Regel. Mein trainertes Modell erreicht vergleichbare Resultate auf Benchmarks, welche mit dem TextWorld framework erzeugt wurden, womit ich eine Baseline basierend auf Policy Gradients einführe. Für leichtere Varianten der Benchmarks erreicht das Modell vergleichbare Performanz zu einem state-of-the-art Model von [Jain et al., 2019], wobei fine-tuning mit Policy Gradients in der Lage ist, besonders schlechte Durchläufe mit Q-learning zu verbessern. Für die schwereren Benchmarks, ist das fine-tuning in der Lage die schlechten Durchläufe des Q-learning Modells selbst nach einer geringer Anzahl an Trainingsschritten zu verbessern. Eine längere fine-tuning Phase, sowie ein Training mit PPO von Beginn an könnten dazu beitragen, bessere Performanz als bisher bekannt auf den Benchmark Spielen zu erreichen. Acknowledgments First and foremost I would like to thank my supervisor Jose Arjona-Medina for the great support and guidance throughout this entire work with his most recent publication [Arjona-Medina et al., 2018] being a substantial part of my work. Also I would like to thank the Institute for Machine Learning for providing as many resources as possible to efficiently conduct experiments and collect results. Special thanks to Vishal Jain for providing most of the code from his prior work ([Jain et al., 2019]) based on algorithmic improvements for interactive fiction. Further, I would like to thank the researchers from Google and Facebook for developing the open source deep learning framework pytorch ([Paszke et al., 2019]) which was utilized for training neural networks. Also thanks to Michael Widrich from the Institute of Machine Learning for developing the python package widis-lstm-tools (https://github.com/widmi/widis-lstm-tools.git) which greatly simplified implementing the LSTM architecture used for return prediction. [Hunter, 2007] developed the python package matplotlib which was used for most visualizations in this work. Also I would like to thank [Côté et al., 2018] for developing the python package TextWorld which enables easy handcrafting of text-based games and convenient interaction of text-based environments with an agent. List of Figures 1 Introduction to Zork . 14 2 Types of Text-Based Games [He et al., 2015] . 14 3 Reinforcement Learning Paradigm, [Sutton and Barto, 1998] . 16 4 Unified View of Reinforcement Learning, [Sutton and Barto, 1998] . 18 5 TD Update for V (st) ................................ 19 6 Overview of the TextWorld Framework, taken from [Côté et al., 2018] . 23 7 Logical Representation of States, P represents the player, taken from [Côté et al., 2018] . 23 8 Logical Representation of Transition function, uppercase letter define types of objects (F: food type, C: container, S: supporter, R: room), P represents the player, and I represents the player’s inventory), ( represents implication, taken from [Côté et al., 2018] . 23 9 Logical Representation of Action Space, taken from [Côté et al., 2018] . 24 10 Benchmarks on curated games within TextWorld, taken from [Côté et al., 2018] . 25 11 Sketch of POMDP formulation . 26 12 Score Contextualization Architecture . 31 13 LSTM Architecture used for return prediction . 35 14 Possible Scores for each level, taken from [Jain et al., 2019] . 39 15 Graphical representation of levels 1 and 2, taken from [Jain et al., 2019] . 40 16 Graphical representation of higher levels, taken from [Jain et al., 2019] . 40 17 Sketch of the map of Zork 1 . 41 18 Results for level 1, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) . 42 19 Results for level 2, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)
Recommended publications
  • Arxiv:2108.09823V1 [Cs.AI] 22 Aug 2021 1 Introduction
    Embodied AI-Driven Operation of Smart Cities: A Concise Review Farzan Shenavarmasouleh1, Farid Ghareh Mohammadi1, M. Hadi Amini2, and Hamid R. Arabnia1 1:Department of Computer Science, Franklin College of arts and sciences, University of Georgia, Athens, GA, USA 2: School of Computing & Information Sciences, College of Engineering & Computing, Florida International University, Miami, FL, USA Emails: [email protected], [email protected], amini@cs.fiu.edu, [email protected] August 24, 2021 Abstract A smart city can be seen as a framework, comprised of Information and Communication Technologies (ICT). An intelligent network of connected devices that collect data with their sensors and transmit them using wireless and cloud technologies in order to communicate with other assets in the ecosystem plays a pivotal role in this framework. Maximizing the quality of life of citizens, making better use of available resources, cutting costs, and improving sustainability are the ultimate goals that a smart city is after. Hence, data collected from these connected devices will continuously get thoroughly analyzed to gain better insights into the services that are being offered across the city; with this goal in mind that they can be used to make the whole system more efficient. Robots and physical machines are inseparable parts of a smart city. Embodied AI is the field of study that takes a deeper look into these and explores how they can fit into real-world environments. It focuses on learning through interaction with the surrounding environment, as opposed to Internet AI which tries to learn from static datasets. Embodied AI aims to train an agent that can See (Computer Vision), Talk (NLP), Navigate and Interact with its environment (Reinforcement Learning), and Reason (General Intelligence), all at the same time.
    [Show full text]
  • Zero-Shot Compositional Concept Learning
    Zero-Shot Compositional Concept Learning Guangyue Xu Parisa Kordjamshidi Joyce Y. Chai Michigan State University Michigan State University University of Michigan [email protected] [email protected] [email protected] Abstract Train Phase: Concept of Sliced Concept of Apple In this paper, we study the problem of rec- Sliced Tomato Sliced Bread Sliced Cake Diced Apple Ripe Apple Peeled Apple Localize, Learn and Compose Regional ognizing compositional attribute-object con- Visual Features cepts within the zero-shot learning (ZSL) framework. We propose an episode-based cross-attention (EpiCA) network which com- Test Phase: bines merits of cross-attention mechanism and Sliced Apple Compose the Learnt Regional Visual Diced Pizza Features episode-based training strategy to recognize … novel compositional concepts. Firstly, EpiCA bases on cross-attention to correlate concept- Figure 1: Given the concepts of sliced and apple in the visual information and utilizes the gated pool- training phase, our target is to recognize the novel com- ing layer to build contextualized representa- positional concept slice apple which doesn’t appear in tions for both images and concepts. The up- the training set by decomposing, grounding and com- dated representations are used for a more in- posing concept-related visual features. depth multi-modal relevance calculation for concept recognition. Secondly, a two-phase episode training strategy, especially the trans- with other objects or attributes, the combination ductive phase, is adopted to utilize unlabeled of this attribute-object pair is not observed in the test examples to alleviate the low-resource training set. learning problem. Experiments on two widely- used zero-shot compositional learning (ZSCL) This is a challenging problem, because objects benchmarks have demonstrated the effective- with different attributes often have a significant di- ness of the model compared with recent ap- versity in their visual features.
    [Show full text]
  • I Hope This Is Helpful": Understanding Crowdworkers’ Challenges and Motivations for an Image Description Task
    105 "I Hope This Is Helpful": Understanding Crowdworkers’ Challenges and Motivations for an Image Description Task RACHEL N. SIMONS, Texas Woman’s University DANNA GURARI, The University of Texas at Austin KENNETH R. FLEISCHMANN, The University of Texas at Austin AI image captioning challenges encourage broad participation in designing algorithms that automatically create captions for a variety of images and users. To create large datasets necessary for these challenges, researchers typically employ a shared crowdsourcing task design for image captioning. This paper discusses findings from our thematic analysis of 1,064 comments left by Amazon Mechanical Turk workers usingthis task design to create captions for images taken by people who are blind. Workers discussed difficulties in understanding how to complete this task, provided suggestions of how to improve the task, gave explanations or clarifications about their work, and described why they found this particular task rewarding or interesting. Our analysis provides insights both into this particular genre of task as well as broader considerations for how to employ crowdsourcing to generate large datasets for developing AI algorithms. CCS Concepts: • Information systems → Crowdsourcing; • Computing methodologies → Computer vision; Computer vision tasks; Image representations; Machine learning; • Human- centered computing → Accessibility. Additional Key Words and Phrases: Crowdsourcing; Computer Vision; Artificial Intelligence; Image Captioning; Accessibility; Amazon Mechanical Turk
    [Show full text]
  • Teaching Visual Recognition Systems
    Teaching visual recognition systems Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan, Prateek Jain, Devi Parikh, Adriana Kovashka, and Jeff Donahue Visual categories Beyond instances, need to recognize and detect classes of visually and semantically related… Objects Scenes Activities Kristen Grauman, UT-Austin Learning-based methods Last ~10 years: impressive strides by learning appearance models (usually discriminative). Novel image Annotator Training images Car Non-car Kristen Grauman, UT-Austin Exuberance for image data (and their category labels) 14M images 1K+ labeled object categories [Deng et al. 2009-2012] ImageNet 80M images 53K noisily labeled object categories [Torralba et al. 2008] 80M Tiny Images 131K images 902 labeled scene categories 4K labeled object categories SUN Database [Xiao et al. 2010] Kristen Grauman, UT-Austin And yet… • More data ↔ more accurate visual models? • Which images should be labeled? X. Zhu, C. Vondrick, D. Ramanan and C. Fowlkes. Do We Need More Training Data or Better Models for Object Detection? BMVC 2012. Kristen Grauman, UT-Austin And yet… • More data ↔ more accurate visual models? X. Zhu, C. Vondrick, D. Ramanan and C. Fowlkes. Do We Need More Training Data or Better Models for Object Detection? BMVC 2012. Kristen Grauman, UT-Austin And yet… • More data ↔ more accurate visual models? • Which images should be labeled? • Are labels enough to teach visual concepts? “This image has a cow in it.” Human annotator [tiny image montage
    [Show full text]
  • M2P3: Multimodal Multi-Pedestrian Path Prediction by Self-Driving Cars with Egocentric Vision
    M2P3: Multimodal Multi-Pedestrian Path Prediction by Self-Driving Cars With Egocentric Vision Atanas Poibrenski Matthias Klusch iMotion Germany GmbH German Research Center for Artificial Intelligence (DFKI) German Research Center for Artificial Intelligence (DFKI) Saarbrücken, Germany Saarbrücken, Germany [email protected] [email protected] Igor Vozniak Christian Müller German Research Center for Artificial Intelligence (DFKI) German Research Center for Artificial Intelligence (DFKI) Saarbrücken, Germany Saarbrücken, Germany [email protected] [email protected] ABSTRACT effective and efficient multi-pedestrian path prediction intraffic Accurate prediction of the future position of pedestrians in traffic scenes by AVs. In fact, there is a plethora of solution approaches scenarios is required for safe navigation of an autonomous vehicle for this problem [65] to be employed in advanced driver assistance but remains a challenge. This concerns, in particular, the effective systems of AVs. Currently, these systems enable an AV to detect if and efficient multimodal prediction of most likely trajectories of a pedestrian is actually in the direction of travel, warn the control tracked pedestrians from egocentric view of self-driving car. In this driver and even stop automatically. Other approaches would allow paper, we present a novel solution, named M2P3, which combines a ADAS to predict whether the pedestrian is going to step on the conditional variational autoencoder with recurrent neural network street, or not [46]. encoder-decoder architecture in order to predict a set of possible The multimodality of multi-pedestrian path prediction in ego-view future locations of each pedestrian in a traffic scene. The M2P3 is a challenge and hard to handle by many deep learning (DL) mod- system uses a sequence of RGB images delivered through an internal els for many-to-one mappings.
    [Show full text]
  • Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift
    Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift Remi Tachet des Combes∗ Han Zhao∗ Microsoft Research Montreal D. E. Shaw & Co. Montreal, QC, Canada New York, NY, USA [email protected] [email protected] Yu-Xiang Wang Geoff Gordon UC Santa Barbara Microsoft Research Montreal Santa Barbara, CA, USA Montreal, QC, Canada [email protected] [email protected] Abstract Adversarial learning has demonstrated good performance in the unsupervised domain adaptation setting, by learning domain-invariant representations. However, recent work has shown limitations of this approach when label distributions differ between the source and target domains. In this paper, we propose a new assumption, generalized label shift (GLS), to improve robustness against mismatched label distributions. GLS states that, conditioned on the label, there exists a representation of the input that is invariant between the source and target domains. Under GLS, we provide theoretical guarantees on the transfer performance of any classifier. We also devise necessary and sufficient conditions for GLS to hold, by using an estimation of the relative class weights between domains and an appropriate reweighting of samples. Our weight estimation method could be straightforwardly and generically applied in existing domain adaptation (DA) algorithms that learn domain-invariant representations, with small computational overhead. In particular, we modify three DA algorithms, JAN, DANN and CDAN, and evaluate their performance on standard and artificial DA tasks. Our algorithms outperform the base versions, with vast improvements for large label distribution mismatches. Our code is available at https://tinyurl.com/y585xt6j. 1 Introduction arXiv:2003.04475v3 [cs.LG] 11 Dec 2020 In spite of impressive successes, most deep learning models [24] rely on huge amounts of labelled data and their features have proven brittle to distribution shifts [43, 59].
    [Show full text]
  • Spectrally-Based Audio Distances Are Bad at Pitch
    I’m Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch Joseph Turian Max Henry [email protected] [email protected] Abstract Growing research demonstrates that synthetic failure modes imply poor generaliza- tion. We compare commonly used audio-to-audio losses on a synthetic benchmark, measuring the pitch distance between two stationary sinusoids. The results are sur- prising: many have poor sense of pitch direction. These shortcomings are exposed using simple rank assumptions. Our task is trivial for humans but difficult for these audio distances, suggesting significant progress can be made in self-supervised audio learning by improving current losses. 1 Introduction Rather than a physical quantity contained in the signal, pitch is a percept that is strongly correlated with the fundamental frequency of a sound. While physically elusive, pitch is a natural dimension to consider when comparing sounds. Children as young as 8 months old are sensitive to melodic contour [65], implying that an early and innate sense of relative pitch is important to parsing our auditory experience. This basic ability is distinct from the acute pitch sensitivity of professional musicians, which improves as a function of musical training [36]. Pitch is important to both music and speech signals. Typically, speech pitch is more narrowly circumscribed than music, having a range of 70–1000Hz, compared to music which spans roughly the range of a grand piano, 30–4000Hz [31]. Pitch in speech signals also fluctuates more rapidly [6]. In tonal languages, pitch has a direct influence on the meaning of words [8, 73, 76]. For non-tone languages such as English, pitch is used to infer (supra-segmental) prosodic and speaker-dependent features [19, 21, 76].
    [Show full text]
  • Kernel Methods for Unsupervised Domain Adaptation
    Kernel Methods for Unsupervised Domain Adaptation by Boqing Gong A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2015 Copyright 2015 Boqing Gong Acknowledgements This thesis concludes a wonderful four-year journey at USC. I would like to take the chance to express my sincere gratitude to my amazing mentors and friends during my Ph.D. training. First and foremost I would like to thank my adviser, Prof. Fei Sha, without whom there would be no single page of this thesis. Fei is smart, knowledgeable, and inspiring. Being truly fortunate, I got an enormous amount of guidance and support from him, financially, academically, and emotionally. He consistently and persuasively conveyed the spirit of adventure in research and academia of which I appreciate very much and from which my interests in trying out the faculty life start. On one hand, Fei is tough and sets a high standard on my research at “home”— the TEDS lab he leads. On the other hand, Fei is enthusiastically supportive when I reach out to conferences and the job market. These combined make a wonderful mix. I cherish every mind-blowing discussion with him, which sometimes lasted for hours. I would like to thank our long-term collaborator, Prof. Kristen Grauman, whom I see as my other academic adviser. Like Fei, she has set such a great model for me to follow on the road of becoming a good researcher. She is a deep thinker, a fantastic writer, and a hardworking professor.
    [Show full text]
  • Building Maps Using Vision for Safe Local Mobile Robot Navigation
    Copyright by Aniket Murarka 2009 The Dissertation Committee for Aniket Murarka certifies that this is the approved version of the following dissertation: Building Safety Maps using Vision for Safe Local Mobile Robot Navigation Committee: Benjamin J. Kuipers, Supervisor Kristen Grauman Risto Miikkulainen Brian Stankiewicz Peter Stone Building Safety Maps using Vision for Safe Local Mobile Robot Navigation by Aniket Murarka, M.S. Dissertation Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy The University of Texas at Austin August 2009 To my family Acknowledgments First and foremost I would like to thank my advisor, Ben Kuipers. I often wonder if I would have been able to cultivate my perspective on science – to take a broader view of the problems I am working on, to look for the right questions to ask, and to understand the value of my and other work – if it had not been for Ben. I greatly appreciate his patience in guiding me while at the same time giving me the freedom to pursue my line of thought and then gently leading me back to the right answer if I drifted too far. I would like to thank my committee members, Peter Stone, Risto Miikkulainen, Kristen Grauman, and Brian Stankiewicz for their well thought out questions and sugges- tions during my proposal and defense. In addition, my individual interactions with them have helped me as a researcher. Peter’s class on multi-agent systems showed me how one could build a complex system from scratch.
    [Show full text]
  • Arxiv:2008.01352V3 [Cs.LG] 23 Mar 2021 Spatiotemporal Disentanglement in the PDE Formalism
    Published as a conference paper at ICLR 2021 PDE-DRIVEN SPATIOTEMPORAL DISENTANGLEMENT Jérémie Donà,y∗ Jean-Yves Franceschi,y∗ Sylvain Lampriery & Patrick Gallinariyz ySorbonne Université, CNRS, LIP6, F-75005 Paris, France zCriteo AI Lab, Paris, France [email protected] ABSTRACT A recent line of work in the machine learning community addresses the problem of predicting high-dimensional spatiotemporal phenomena by leveraging specific tools from the differential equations theory. Following this direction, we propose in this article a novel and general paradigm for this task based on a resolution method for partial differential equations: the separation of variables. This inspiration allows us to introduce a dynamical interpretation of spatiotemporal disentanglement. It induces a principled model based on learning disentangled spatial and temporal representations of a phenomenon to accurately predict future observations. We experimentally demonstrate the performance and broad applicability of our method against prior state-of-the-art models on physical and synthetic video datasets. 1 INTRODUCTION The interest of the machine learning community in physical phenomena has substantially grown for the last few years (Shi et al., 2015; Long et al., 2018; Greydanus et al., 2019). In particular, an increasing amount of works studies the challenging problem of modeling the evolution of dynamical systems, with applications in sensible domains like climate or health science, making the understanding of physical phenomena a key challenge in machine learning. To this end, the community has successfully leveraged the formalism of dynamical systems and their associated differential formulation as powerful tools to specifically design efficient prediction models. In this work, we aim at studying this prediction problem with a principled and general approach, through the prism of Partial Differential Equations (PDEs), with a focus on learning spatiotemporal disentangled representations.
    [Show full text]
  • Large Margin Deep Networks for Classification
    Large Margin Deep Networks for Classification Gamaleldin F. Elsayed ∗ Dilip Krishnan Hossein Mobahi Kevin Regan Google Research Google Research Google Research Google Research Samy Bengio Google Research {gamaleldin, dilipkay, hmobahi, kevinregan, bengio}@google.com Abstract We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically suc- cessful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer. Such methods are therefore not well suited for deep networks. In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any lp norm (p ≥ 1) on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on multiple tasks: generalization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques such as weight decay, dropout, and batch norm. 2 1 Introduction The large margin principle has played a key role in the course of machine learning history, producing remarkable theoretical and empirical results for classification (Vapnik, 1995) and regression problems (Drucker et al., 1997).
    [Show full text]
  • Dark, Beyond Deep: a Paradigm Shift to Cognitive AI with Humanlike Common Sense
    Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense Yixin Zhua,∗, Tao Gaoa, Lifeng Fana, Siyuan Huanga, Mark Edmondsa, Hangxin Liua, Feng Gaoa, Chi Zhanga, Siyuan Qia, Ying Nian Wua, Joshua B. Tenenbaumb, Song-Chun Zhua aCenter for Vision, Cognition, Learning, and Autonomy (VCLA), UCLA bCenter for Brains, Minds, and Machines (CBMM), MIT Abstract Recent progress in deep learning is essentially based on a “big data for small tasks” paradigm, under which massive amounts of data are used to train a classifier for a single narrow task. In this paper, we call for a shift that flips this paradigm upside down. Specifically, we propose a “small data for big tasks” paradigm, wherein a single artificial intelligence (AI) system is challenged to develop “common sense,” enabling it to solve a wide range of tasks with little training data. We illustrate the potential power of this new paradigm by reviewing models of common sense that synthesize recent breakthroughs in both machine and human vision. We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense. When taken as a unified concept, FPICU is concerned with the questions of “why” and “how,” beyond the dominant “what” and “where” framework for understanding vision. They are invisible in terms of pixels but nevertheless drive the creation, maintenance, and development of visual scenes. We therefore coin them the “dark matter” of vision. Just as our universe cannot be understood by merely studying observable matter, we argue that vision cannot be understood without studying FPICU.
    [Show full text]