Methods for Efficient Supervision in Natural Language Processing

Total Page:16

File Type:pdf, Size:1020Kb

Methods for Efficient Supervision in Natural Language Processing Methoden voor efficiënte supervisie van automatische taalverwerking Methods for Efficient Supervision in Natural Language Processing Lucas Sterckx Promotoren: prof. dr. ir. C. Develder, dr. ir. T. Demeester Proefschrift ingediend tot het behalen van de graad van Doctor in de ingenieurswetenschappen: computerwetenschappen Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. B. Dhoedt Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2017 - 2018 ISBN 978-94-6355-130-4 NUR 984 Wettelijk depot: D/2018/10.500/48 Ghent University Faculty of Engineering and Architecture Department of Information Technology imec Internet Technology and Data Science Lab Methods for Efficient Supervision in Natural Language Processing Examination Board: prof. C. Develder (advisor) dr. ir. T. Demeester (advisor) em. prof. D. De Zutter (chair) prof. B. Dhoedt (secretary) prof. I. Augenstein prof. A. Bronselaer prof. V. Hoste Dissertation for acquiring the degree of Doctor of Computer Science Engineering Dankwoord Tijdens mijn doctoraat werd ik onvoorwaardelijk, door dik en dun, ge- steund door vele mensen. Ik dank iedereen die me alle kansen gaf tot persoonlijke en professionele groei. Een dankjewel, aan prof. Chris Develder, die me vijf jaar lang het vertrouwen gaf en alle kansen tot wetenschappelijk onderzoek, samen met de vrijheid tot het be- wandelen van mijn eigen pad; aan Thomas Demeester, die, ondanks zijn eigen drukke agenda, altijd tijd maakte en klaar stond met advies en een luisterend oor, steeds met het- zelfde enthousiasme en dezelfde vriendelijkheid; aan mijn ouders, Godelieve en Paul; aan mijn broer en schoonzus, Thomas en Karolien; aan mijn meter, Maria; aan Johannes, die me met al zijn geduld en kennis bijstond vanaf dag één en me leerde kritisch zijn over eigen en bestaand werk; to Nasrin, who, despite living abroad, was always cheerful and kind to everyone, all while producing an incredible amount of high-quality re- search; to Giannis, or Lucas V2.0 as I like to call him, who picked up the torch and ran so fast with it, for making the future of research in NLP at IDLab look bright; to all my research collaborators over the years, Cedric, Fréderic, Baptist, Klim, Thong, Steven, Matthias, Jason, Bill, Cornelia and Laurent, from whom I learned so much; ii to the members of the examination board, em. prof. Daniël De Zutter, prof. Bart Dhoedt, prof. Isabelle Augenstein, prof. Antoon Bronselaer and prof. Veronique Hoste, for making time and providing me with nothing but con- structive feedback on this thesis and my research in general; aan prof. Piet Demeester, de drijvende kracht achter de onderzoeksgroep; to the founders and attendees of the machine learning reading group, who gracefully shared their knowledge every week; to the IDLab admins and support staff for always standing by with techni- cal support and providing me with the infrastructure to do research; to all my friends and colleagues at IDLab for providing me with such a pleasant working environment; to all friends I made abroad, who made me feel at home when I was far from it; aan alle collega’s en jobstudenten met wie ik samen aan projecten werkte, om me de middelen te geven om aan onderzoek te doen; aan het Fonds voor Wetenschappelijk Onderzoek - Vlaanderen (FWO), welke via een reisbeurs mijn onderzoek ondersteunde; aan al mijn collega-muzikanten en de leden van Concertband Theobaldus Groot-Pepingen; en aan al mijn vrienden uit het Pajottenland en het Gentse. Gent, juni 2018 Lucas Sterckx “The really important kind of freedom involves attention, and awareness, and discipline, and effort, and being able truly to care about other people and to sacrifice for them, over and over, in myriad petty little unsexy ways, every day." — David Foster Wallace, 2005 Table of Contents Dankwoordi Samenvatting xxi Summary xxv 1 Introduction1 1.1 The Data Bottleneck........................2 1.2 Beyond Traditional Supervision.................5 1.2.1 Semi-Supervised Learning................5 1.2.2 Active Learning......................7 1.2.3 Multi-Task Learning...................7 1.2.4 Transfer Learning.....................7 1.2.5 Weak Supervision.....................8 1.3 Efficient Supervision for NLP..................9 1.3.1 Semi-Supervised Learning for NLP..........9 1.3.2 Distant Supervision................... 10 1.3.3 Information Extraction using Weak Supervision... 10 1.3.4 Crowdsourcing...................... 11 1.3.5 Data Augmentation for NLP.............. 12 1.3.6 Transfer Learning for NLP................ 12 1.3.7 Multi-Task Learning for NLP.............. 13 1.4 Research contributions...................... 13 1.5 Publications............................ 16 1.5.1 Publications in international journals (listed in the Science Citation Index).......... 17 1.5.2 Publications in international conferences (listed in the Science Citation Index).......... 17 1.5.3 Publications in other international conferences.... 18 References................................ 19 2 Weak Supervision for Automatic Knowledge Base Population 27 2.1 Introduction............................ 28 2.2 Related Work........................... 31 2.2.1 Supervised Relation Extraction............. 31 iv 2.2.1.1 Bootstrapping models for Relation Extraction 32 2.2.1.2 Distant Supervision.............. 33 2.2.2 Semi-supervised Relation Extraction.......... 33 2.2.3 TAC KBP English Slot Filling.............. 34 2.2.4 Active Learning and Feature Labeling......... 35 2.2.5 Distributional Semantics................. 35 2.3 Labeling Strategy for Noise Reduction............. 36 2.3.1 Distantly Supervised Training Data.......... 37 2.3.2 Labeling High Confidence Shortest Dependency Paths 41 2.3.3 Noise Reduction using Semantic Label Propagation. 45 2.4 Experimental Results....................... 46 2.4.1 Testing Methodology................... 46 2.4.2 Knowledge Base Population System.......... 47 2.4.3 Methodologies for Supervision............. 47 2.4.4 Pattern-based Restriction vs. Similarity-based Exten- sion............................. 48 2.4.5 End-to-End Knowledge Base Population Results... 51 2.4.6 2015 TAC KBP Cold Start Slot Filling.......... 55 2.5 Conclusions............................ 55 2.A Using Active Learning and Semantic Clustering for Noise reduction in Distant Supervision................ 56 2.A.1 Introduction........................ 57 2.A.2 Related Work....................... 57 2.A.3 Semantic-Cluster-Aware Sampling........... 58 2.A.4 Experiments and Results................. 59 2.A.5 Conclusion........................ 62 References................................ 63 3 Weakly Supervised Evaluation of Topic Models 73 3.1 Introduction............................ 74 3.2 Experimental Setup........................ 75 3.3 Topic Model Assessment..................... 76 3.3.1 Measuring Topical Alignment.............. 76 3.3.2 Semantic Coherence................... 77 3.3.3 Graphical Alignment of Topics............. 78 3.4 Conclusion............................. 79 References................................ 81 4 Creation and Evaluation of Large Keyphrase Extraction Collec- tions with Multiple Opinions 83 4.1 Introduction............................ 84 4.2 Test Collections.......................... 85 4.2.1 Document Collection................... 86 4.2.2 Collecting Keyphrases.................. 87 v 4.2.3 Annotation Tool...................... 87 4.2.4 Keyphrase Annotation Guidelines........... 89 4.2.5 Annotator Disagreement................. 91 4.3 Keyphrase Extraction Techniques................ 95 4.3.1 Candidate Selection................... 95 4.3.2 Unsupervised Keyphrase Extraction.......... 96 4.3.3 Supervised Keyphrase Extraction........... 98 4.3.3.1 Feature Design and Classification...... 99 4.3.3.2 Supervised Model............... 100 4.4 Systematic Evaluation...................... 101 4.4.1 Experimental set-up................... 101 4.4.2 Evaluation Setup using Multiple Opinions...... 103 4.4.3 Comparison of different techniques.......... 105 4.4.4 Comparison of different test collections........ 106 4.4.5 Training set size...................... 108 4.4.6 Training data from multiple opinions......... 108 4.4.7 Effect of Document Length............... 109 4.5 Guidelines for Automatic Keyphrase Extraction....... 109 4.A Supervised Keyphrase Extraction as Positive Unlabeled Learn- ing.................................. 112 4.A.1 Introduction........................ 112 4.A.2 Noisy Training Data for Supervised Keyphrase Ex- traction........................... 113 4.A.3 Reweighting Keyphrase Candidates.......... 115 4.A.3.1 Experiments and Results........... 116 4.A.4 Conclusion........................ 118 References................................ 120 5 Sequence-to-Sequence Applications using Weak Supervision 127 5.1 Introduction............................ 127 5.2 Break it Down for Me: A Study in Automated Lyric Annotation............ 130 5.2.1 Introduction........................ 130 5.2.2 The Genius ALA Dataset................ 131 5.2.3 Context Independent Annotation............ 132 5.2.4 Baselines.......................... 133 5.2.5 Evaluation......................... 134 5.2.5.1 Data....................... 134 5.2.6 Measures.......................... 135 5.2.6.1 Hyperparameters and Optimization.... 135 5.2.7 Results........................... 136 5.2.8 Related Work....................... 137 5.2.9 Conclusion and Future Work.............. 137 5.3 Prior Attention for Style-aware Sequence-to-Sequence Models140 5.3.1 Introduction........................ 140 vi 5.3.2 Generation of Prior Attention.............
Recommended publications
  • Generating Multi-Agent Trajectories Using Programmatic Weak
    Published as a conference paper at ICLR 2019 GENERATING MULTI-AGENT TRAJECTORIES USING PROGRAMMATIC WEAK SUPERVISION Eric Zhan Stephan Zheng∗ Caltech Salesforce [email protected] [email protected] Yisong Yue Long Sha & Patrick Lucey Caltech STATS [email protected] flsha,[email protected] ABSTRACT We study the problem of training sequential generative models for capturing co- ordinated multi-agent trajectory behavior, such as offensive basketball gameplay. When modeling such settings, it is often beneficial to design hierarchical mod- els that can capture long-term coordination using intermediate variables. Further- more, these intermediate variables should capture interesting high-level behavioral semantics in an interpretable and manipulatable way. We present a hierarchical framework that can effectively learn such sequential generative models. Our ap- proach is inspired by recent work on leveraging programmatically produced weak labels, which we extend to the spatiotemporal regime. In addition to synthetic settings, we show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multi-agent trajec- tories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study compar- ison conducted with professional sports analysts.1 1 INTRODUCTION The ongoing explosion of recorded tracking data is enabling the study of fine-grained behavior in many domains: sports (Miller et al., 2014; Yue et al., 2014; Zheng et al., 2016; Le et al., 2017), video games (Ross et al., 2011), video & motion capture (Suwajanakorn et al., 2017; Taylor et al., 2017; Xue et al., 2016), navigation & driving (Ziebart et al., 2009; Zhang & Cho, 2017; Li et al., 2017), lab- oratory animal behaviors (Johnson et al., 2016; Eyjolfsdottir et al., 2017), and tele-operated robotics (Abbeel & Ng, 2004; Lin et al., 2006).
    [Show full text]
  • Constrained Labeling for Weakly Supervised Learning
    Constrained Labeling for Weakly Supervised Learning Chidubem Arachie1 Bert Huang2 1Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA 2Department of Computer Science, Data Intensive Studies Center, Tufts University, Medford, Massachusetts, USA Abstract sion. Weakly supervised learning involves training models using noisy labels. Using multiple sources or forms of weak supervision is common, as it provides diverse information Curation of large fully supervised datasets has be- to the model. However, each source of weak supervision has come one of the major roadblocks for machine its own bias that can be transmitted to the model. Different learning. Weak supervision provides an alterna- weak supervision signals can also conflict, overlap, or—in tive to supervised learning by training with cheap, the worst case—make dependent errors. Thus, a naive com- noisy, and possibly correlated labeling functions bination of these weak signals would hurt the quality of from varying sources. The key challenge in weakly a learned model. The key problem then is how to reliably supervised learning is combining the different combine various sources of weak signals to train an accurate weak supervision signals while navigating mislead- model. ing correlations in their errors. In this paper, we propose a simple data-free approach for combin- To solve this problem, we propose constrained label learn- ing weak supervision signals by defining a con- ing (CLL), a method that processes various weak supervi- strained space for the possible labels of the weak sion signals and combines them to produce high-quality signals and training with a random labeling within training labels. The idea behind CLL is that, given the weak this constrained space.
    [Show full text]
  • Multi-Task Weak Supervision Enables Automated Abnormality Localization in Whole-Body FDG-PET/CT
    Multi-task weak supervision enables automated abnormality localization in whole-body FDG-PET/CT Sabri Eyuboglu∗1, Geoffrey Angus∗1, Bhavik N. Patel2, Anuj Pareek2, Guido Davidzon2, Jared Dunnmon,∗∗1 Matthew P. Lungren ∗∗2 1 Department of Computer Science, Stanford University, Stanford, CA 94305, USA 2 Department of Radiology, Stanford University, Stanford, CA 94305, USA ∗ ∗∗ Equal contribution; Computational decision support systems could provide clinical value in whole-body FDG- PET/CT workflows. However, the lack of labeled data combined with the sheer size of each scan make it challenging to apply existing supervised machine learning systems. Leveraging recent advancements in natural language processing, we develop a weak supervision frame- work that extracts imperfect, yet highly granular regional abnormality labels from the radi- ologist reports associated with each scan. Our framework automatically labels each region in a custom ontology of anatomical regions, providing a structured profile of the pathologies in each scan. We design an attention-based, multi-task CNN architecture that can be trained using these generated labels to localize abnormalities in whole-body scans. We show that our multi-task representation is critical for strong performance on rare abnormalities for which we have little data and for accurately predicting mortality from imaging data alone (AUROC 80%), a critical challenge for palliative care teams. 1 Introduction The availability of large, labeled datasets has fueled recent progress in machine learning for med- ical imaging. Breakthrough studies in chest radiograph diagnosis, skin and mammographic lesion classification, and diabetic retinopathy detection have relied on hundreds of thousands of labeled training examples [1–4].
    [Show full text]
  • Visual Learning with Weak Supervision
    Visual Learning with Weak Supervision Safa Cicek Committee: Prof. Stefano Soatto, Chair Prof. Lieven Vandenberghe Prof. Paulo Tabuada Prof. Guy Van den Broeck PHD Defense January 2021 UCLA ELECTRICAL AND COMPUTER ENGINEERING Table of Contents Introduction SaaS: Speed as a Supervisor for Semi-supervised Learning [1] Input and Weight Space Smoothing for Semi-supervised Learning [2] Unsupervised Domain Adaptation via Regularized Conditional Alignment [3] Disentangled Image Generation for Unsupervised Domain Adaptation [4] Spatial Class Distribution Shift in Unsupervised Domain Adaptation [5] Learning Topology from Synthetic Data for Unsupervised Depth Completion [6] Targeted Adversarial Perturbations for Monocular Depth Prediction [7] Concluding Remarks [1] Cicek, Safa, Alhussein Fawzi, and Stefano Soatto. Saas: Speed as a supervisor for semi-supervised learning. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [2] Cicek, Safa, and Stefano Soatto. Input and Weight Space Smoothing for Semi-supervised Learning. Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops. 2019. [3] Cicek, Safa, and Stefano Soatto. Unsupervised domain adaptation via regularized conditional alignment. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019. [4] Cicek, Safa, Zhaowen Wang, Hailin Jin, Stefano Soatto, Generative Feature Disentangling for Unsupervised Domain Adaptation. Proceedings of the European Conference on Computer Vision (ECCV) Workshops. (2020). [5] Cicek, Safa, Ning Xu, Zhaowen Wang, Hailin Jin, Stefano Soatto, Spatial Class Distribution Shift in Unsupervised Domain Adaptation. Asian Conference on Computer Vision (ACCV). 2020. [6] Wong Alex, Safa Cicek, Stefano Soatto, Learning Topology from Synthetic Data for Unsupervised Depth Completion, IEEE Robotics and Automation Letters (RAL). 2021. [7] Wong Alex, Safa Cicek, Stefano Soatto, Targeted Adversarial Perturbations for Monocular Depth Prediction.
    [Show full text]
  • Natural Language Processing and Weak Supervision
    Natural language processing and weak supervision L´eonBottou COS 424 { 4/27/2010 Introduction Natural language processing \from scratch" { Natural language processing systems are heavily engineered. { How much engineering can we avoid by using more data ? { Work by Ronan Collobert, Jason Weston, and the NEC team. Summary { Natural language processing { Embeddings and models { Lots of unlabeled data { Task dependent hacks 2/43 COS 424 { 4/27/2010 I. Natural language processing The Goal We want to have a conversation with our computer . still a long way before HAL 9000 . Convert a piece of English into a computer-friendly data structure How to measure if the computer \understands" something? 4/43 COS 424 { 4/27/2010 Natural Language Processing Tasks Intermediate steps to reach the goal? Part-Of-Speech Tagging (POS): syntactic roles (noun, adverb...) Chunking (CHUNK): syntactic constituents (noun phrase, verb phrase...) Name Entity Recognition (NER): person/company/location... Semantic Role Labeling (SRL): semantic role [John]ARG0 [ate]REL [the apple]ARG1 [in the garden]ARGM LOC − 5/43 COS 424 { 4/27/2010 NLP Benchmarks Datasets: ? POS, CHUNK, SRL: WSJ( up to 1M labeled words) ? NER: Reuters( 200K labeled≈ words) ≈ System Accuracy System F1 Shen, 2007 97.33% Shen, 2005 95.23% Toutanova, 2003 97.24% Sha, 2003 94.29% Gimenez, 2004 97.16% Kudoh, 2001 93.91% (a) POS: As in (Toutanova, 2003) (b) CHUNK: CoNLL 2000 System F1 System F1 Ando, 2005 89.31% Koomen, 2005 77.92% Florian, 2003 88.76% Pradhan, 2005 77.30% Kudoh, 2001 88.31% Haghighi, 2005
    [Show full text]
  • Towards a Perceptually Relevant Control Space
    Music sound synthesis using machine learning: Towards a perceptually relevant control space Fanny ROCHE PhD Defense 29 September 2020 Supervisor: Laurent GIRIN Co-supervisors: Thomas HUEBER Co-supervisors: Maëva GARNIER Co-supervisors: Samuel LIMIER Fanny ROCHE - PhD Defense - 29 September 2020 1 / 48 Context and Objectives Context and Objectives Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Context and Objectives Context and Objectives Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Context and Objectives Context and Objectives Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Processed recordings ! e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming Spectral modeling ! e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific Physical modeling ! e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds Context and Objectives Context and Objectives Music sound synthesis Abstract algorithms ! e.g. FM, waveshaping, ... + rich sounds − complex parameters Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Spectral modeling ! e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific Physical modeling ! e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds Context and Objectives Context and Objectives Music sound synthesis Abstract algorithms ! e.g. FM, waveshaping, ... + rich sounds − complex parameters Processed recordings ! e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Physical modeling ! e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds Context and Objectives Context and Objectives Music sound synthesis Abstract algorithms ! e.g. FM, waveshaping, ... + rich sounds − complex parameters Processed recordings ! e.g.
    [Show full text]
  • Self-Supervised Collaborative Learning for Medical Image Synthesis
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Auto-GAN: Self-Supervised Collaborative Learning for Medical Image Synthesis Bing Cao,1 Han Zhang,2∗ Nannan Wang,1∗ Xinbo Gao,1 Dinggang Shen2∗ 1State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China 2Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA Abstract tors desire as many different modalities as possible to make a joint and more accurate decision, certain limitations (e.g., In various clinical scenarios, medical image is crucial in dis- restricted medical conditions, inadequate scanning time, and ease diagnosis and treatment. Different modalities of medical images provide complementary information and jointly helps cost/spending/resource control) could result in sacrificing doctors to make accurate clinical decision. However, due to some imaging modalities to have the most important one clinical and practical restrictions, certain imaging modalities done. For examples, low-resolution (e.g., thick-sliced) imag- may be unavailable nor complete. To impute missing data ing modalities are always combined with another single with adequate clinical accuracy, here we propose a frame- high-resolution imaging modality to save scanning time and work called self-supervised collaborative learning to syn- the low-resolution images can be misleading due to its par- thesize missing modality for medical images. The proposed tial coverage. CT is notorious for poor soft-tissue contrast method comprehensively utilize all available information cor- and radiation exposure, but can better quantify tissue’s phys- related to the target modality from multi-source-modality im- ical density. MRI is non-invasive and safer with better con- ages to generate any missing modality in a single model.
    [Show full text]
  • 5Efa37f24b66fa5fa397c116 Bec
    © Prolego, Inc., 2018. All rights reserved. ISBN 978-0-692-19233-7 This book or parts thereof may not be reproduced in any form, stored in any retrieval system, or transmitted in any form by any means—electronic, mechanical, photocopy, recording, or otherwise—without prior written permission of Prolego, except as provided by United States of America copyright law. For permission requests contact Prolego. Prolego, Inc. 400 West Peachtree Street NW Suite #4 - 868 Atlanta, GA 30308 prolego.io [email protected] Contents - 07 Introduction - 23 PART 1: AI Fundamentals - 45 PART 2: Discovering AI Opportunities - 75 PART 3: Building a Winning AI Strategy - 97 PART 4: Launching Your First AI Product - 117 Epilogue - 121 Appendices 6 – INTRODUCTION 6 – INTRODUCTION Even if there were a trustworthy way to send money over the Internet– which there isn’t–the network is missing a most essential ingredient of capitalism: salespeople. — Clifford Stoll “The Internet? Bah!” Newsweek 1995 BECOME AN AI COMPANY IN 90 DAYS – 7 Introduction THE ERA OF INTELLIGENT COMPUTING Congratulations. You’re living through a once-in-a-generation technology shift—the era of artificial intelligence, or AI. Like previous fundamental shifts such as electricity, the computer, or the Internet, AI will change everything. Skeptical about such big claims? I don’t blame you. Over the past few decades we’ve been bombarded by an endless parade of new technologies promising to “disrupt” everything and solve all our problems. BECOME AN AI COMPANY IN 90 DAYS 8 – INTRODUCTION Most of these technologies never yield more than incremental benefits: > Object-oriented programming made software only slightly more reusable.
    [Show full text]
  • Weakly Supervised Text Classification with Naturally Annotated Resource
    Automated Knowledge Base Construction (2021) Conference paper NatCat: Weakly Supervised Text Classification with Naturally Annotated Resources Zewei Chu [email protected] University of Chicago, Chicago, IL 60637, USA Karl Stratos [email protected] Rutgers University, Piscataway, NJ 08854, USA Kevin Gimpel [email protected] Toyota Technological Institute at Chicago, Chicago, IL 60637, USA Abstract We describe NatCat, a large-scale resource for text classification constructed from three data sources: Wikipedia, Stack Exchange, and Reddit. NatCat consists of document- category pairs derived from manual curation that occurs naturally within online communities. To demonstrate its usefulness, we build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval), reporting large improvements compared to prior work. We benchmark different modeling choices and resource combinations and show how tasks benefit from particular NatCat data sources.1 1. Introduction Websites with community contributed content contain ample knowledge for natural lan- guage processing. In this paper, we seek to improve text classification by leveraging this knowledge in the form of texts paired with natural category annotations. In particular, we present NatCat, a large-scale resource constructed automatically from three data sources: Wikipedia, Stack Exchange,2 and Reddit.3 With rich knowledge about text and categories, we show that NatCat is a useful resource for text classification. To demonstrate the usefulness of NatCat, we use it to train models that compute a score for any document-category pair. As a result, we get weakly supervised (i.e., \dataless"; Chang et al., 2008) text classifiers that can be used off-the-shelf to produce interpretable and relevant topics for any document.
    [Show full text]
  • EG Analysistofairhousing Cover Draftmay2009.Ai
    Analysis of Impediments to Fair Housing & Fair Housing Plan - May , C ITY OF ELK G ROVE A NALYSIS OF I MPEDIMENTS TO F AIR H OUSING C HOICE FAIR HOUSING PLAN 2008-09 Prepared by: CITY OF ELK GROVE 8401 LAGUNA PALMS WAY ELK GROVE, CA 95758 MAY 15, 2009 TABLE OF CONTENTS TABLE OF CONTENTS Executive Summary .................................................................................................................................. ES-1 1 Introduction ........................................................................................................................................ 1-1 2 Review of Previous Analysis of Impediments ................................................................................. 2-1 3 Community Setting ............................................................................................................................ 3-1 4 Housing Market Conditions .............................................................................................................. 4-1 5 Mortgage Lending (HMDA Data) ................................................................................................... 5-1 6 Government Barriers to Fair Housing Choice ................................................................................ 6-1 7 Fair Housing Enforcement and Education ..................................................................................... 7-1 8 Summary Analysis .............................................................................................................................
    [Show full text]
  • Weakly Supervised Short Text Categorization Using World Knowledge
    Weakly Supervised Short Text Categorization Using World Knowledge Rima T¨urker1;2[0000−0001−8332−7241], Lei Zhang1[0000−0001−8184−9353], Mehwish Alam1;2[0000−0002−7867−6612], and Harald Sack1;2[0000−0001−7069−9804] 1 FIZ Karlsruhe { Leibniz Institute for Information Infrastructure, Germany 2 Karlsruhe Institute of Technology, Institute AIFB, Germany [email protected] [email protected] Abstract. Short text categorization is an important task in many NLP applications, such as sentiment analysis, news feed categorization, etc. Due to the sparsity and shortness of the text, many traditional classifi- cation models perform poorly if they are directly applied to short text. Moreover, supervised approaches require large amounts of manually la- beled data, which is a costly, labor intensive, and time-consuming task. This paper proposes a weakly supervised short text categorization ap- proach, which does not require any manually labeled data. The proposed model consists of two main modules: (1) a data labeling module, which leverages an external Knowledge Base (KB) to compute probabilistic labels for a given unlabeled training data set, and (2) a classification model based on a Wide & Deep learning approach. The effectiveness of the proposed method is validated via evaluation on multiple datasets. The experimental results show that the proposed approach outperforms unsupervised state-of-the-art classification approaches and achieves com- parable performance to supervised approaches. Keywords: Short Text Categorization · Weakly Supervised Short Text Categorization · Wide & Deep Model. 1 Introduction Due to rapid growth of the Web content, short text data such as search snippets, news feeds, short messages, etc.
    [Show full text]
  • Analysis of Impediments to Fair Housing Choice in New Hampshire
    Analysis of Impediments to Fair Housing Choice in New Hampshire 2020 UPDATE May 2021 Prepared for New Hampshire Housing Finance Authority and New Hampshire Community Development Finance Authority by New Hampshire Legal Assistance Analysis of Impediments to Fair Housing Choice in New Hampshire 2020 Update May 2021 Prepared for New Hampshire Housing Finance Authority and New Hampshire Community Development Finance Authority by New Hampshire Legal Assistance The publication of the 2020 Update to the Analysis of Impediments to Fair Housing Choice in New Hampshire (2020 AI) marks the fifth Analysis conducted since 1996 when the Office of State Planning issued the original Analysis of Impediments (AI). By 2004, New Hampshire Housing Finance Authority (NHHFA) had taken responsibility for evaluating barriers to housing opportunity and published an update to the AI. In 2010, New Hampshire Community Development Finance Authority (CDFA) also began supporting publication of the AI. The most recent update was published in 2015. New Hampshire Legal Assistance (NHLA), under contract with NHHFA and CDFA, has produced all four of the updates. Scope of Investigation In preparing the 2020 AI Report, the following activities were undertaken: 1. Thirty-year review of affirmatively furthering fair housing in the state 2. Demographic analyses; 3. Evaluation of areas of high poverty and racial/ethnic concentrations; 4. A focus group and a set of one-on-one interviews with community members; 5. A survey of the state’s Public Housing Authorities (PHAs); 6. Analysis of the housing barriers facing residents with serious mental illness; 7. Research into legal developments at the federal and state level; 8.
    [Show full text]