Arxiv:1912.13337V2 [Cs.CL] 1 Sep 2020

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:1912.13337V2 [Cs.CL] 1 Sep 2020 What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge Kyle Richardson and Ashish Sabharwal Allen Institute for AI, Seattle, WA, USA {kyler,ashishs}@allenai.org Abstract Benchmark Tasks 1.OpenBook QA (OBQA) (Mihaylov et al., 2018) Open-domain question answering (QA) in- Question: Which of the following is a [specific type of] learned behavior? A. thinking B. cooking C. hearing D. breathing volves many knowledge and reasoning chal- 2. ARC Challenge (Clark et al., 2018) lenges, but are successful QA models actu- Question: What is a worldwide increase in temperature called [definition]? ally learning such knowledge when trained A. greenhouse effect B. global warming C. ozone depletion D. solar heating on benchmark QA tasks? We investigate Train this via several new diagnostic tasks probing Question-Answering Model whether multiple-choice QA models know Evaluate Continue Training definitions and taxonomic reasoning—two Dataset Probes skills widespread in existing benchmarks Question : In ‘the toddler could Question : In ‘the toddler could and fundamental to more complex reason- count’, count is best defined as count to 100’, count is a type of ing. We introduce a methodology for auto- A. name or recite the numbers... A. recite event B.... C.... matically building probe datasets from ex- Generate: GEN(τ) pert knowledge sources, allowing for sys- Expert Knowledge (Triples T ) (type-of,count.v.03,recite.v.02) tematic control and a comprehensive eval- (ex,count.v.02, count, ’the toddler could count’) uation. We include ways to carefully (defined-as,count.v.02, ’name or recite the numbers...’ ) (defined-as, recite.v.02, ’read aloud from memory.’ ) control for artifacts that may arise dur- ing this process. Our evaluation confirms Figure 1: An illustration of our experimental setup and that transformer-based multiple-choice QA probing methodology. The gray box at the top shows models are already predisposed to recognize questions from existing open-domain QA benchmarks, certain types of structural linguistic knowl- requiring background knowledge. The yellow box edge. However, it also reveals a more nu- shows simple examples of multiple-choice questions in anced picture: their performance notably our proposed Definition and ISA probes. degrades even with a slight increase in the number of “hops” in the underlying taxo- Recent success in QA has been driven largely nomic hierarchy, and with more challeng- by new benchmarks (Zellers et al., 2018; Tal- ing distractor candidates. Further, existing models are far from perfect when assessed mor et al., 2019b; Bhagavatula et al., 2020; Khot at the level of clusters of semantically con- et al., 2020, etc.) and advances in model pre- nected probes, such as all hypernym ques- training (Radford et al., 2018; Devlin et al., 2019). tions about a single concept. This raises a natural question: Do state-of-the-art multiple-choice QA (MCQA) models that excel at arXiv:1912.13337v2 [cs.CL] 1 Sep 2020 standard benchmarks truly possess basic knowl- 1 Introduction edge and reasoning skills expected in these tasks? Automatically answering questions, especially in Answering this question is challenging due to the open-domain setting where minimal or no con- limited understanding of heavily pre-trained com- textual knowledge is explicitly provided, requires plex models and the way existing MCQA datasets considerable background knowledge and reason- are constructed. We focus on the second aspect, ing abilities. For example, answering the two which has two limitations: Large-scale crowd- questions in the top gray box in Figure1 requires sourcing leaves little systematic control over ques- identifying a specific ISA relation (that ‘cooking’ tion semantics or requisite background knowl- is a type of ‘learned behavior’) as well as recall- edge (Welbl et al., 2017), while questions from ing a concept definition (that ‘global warming’ is real exams tend to mix multiple challenges in defined as a ‘worldwide increase in temperature’). a single dataset, often even in a single ques- tion (Clark et al., 2018; Boratko et al., 2018). probes measure competence in various settings To address this challenge, we propose system- including hypernymy, hyponymy, and synonymy atically constructing model competence probes by detection, as well as word sense disambiguation. exploiting structured information contained in ex- Our exploration is closely related to the recent pert knowledge sources such as knowledge graphs work of Talmor et al.(2019a). However, a key dif- and lexical taxonomies. Importantly, these probes ference is that they study language models (LMs), are diagnostic tasks, designed not to impart new for which there is no clear a priori expectation of knowledge but to assess what models trained on specific knowledge or reasoning skills. In contrast, standard QA benchmarks already know; as such, we focus on models heavily trained for benchmark they serve as proxies for the types of questions QA tasks, where such tasks are known to require that a model might encounter in its original task, certain types of knowledge and reasoning skills. but involve a single category of knowledge under We probe whether such skills are actually learned various controlled conditions and perturbations. by QA models, either during LM pre-training or Figure1 illustrates our methodology. We start when training for the QA tasks. with a set of standard MCQA benchmark tasks D Recognizing the need for suitable controls in and a set of models M trained on D. Our goal is to any synthetic probing methodology (Hewitt and assess how competent these models are relative to Liang, 2019; Talmor et al., 2019a), we introduce a particular knowledge or reasoning skill S (e.g., two controls: (a) the probe must be challenging definitions) that is generally deemed important for for any model that lacks contextual embeddings, performing well on D. To this end, we systemati- and (b) strong models must have a low inoculation cally and automatically generate a set of dataset cost, i.e., when fine-tuned on a few probing ex- probes PS from information available in expert amples, the model should mostly retain its perfor- knowledge sources. Each probe is an MCQA ren- mance on its original task.2 This ensures that the dering of the target information (see examples in probe performance of a model, even when lightly Figure1, yellow box). We then use these probes inoculated on probing data, reflects its knowledge PS to ask two empirical questions: (1) How well as originally trained for the benchmark task, which do models in M already trained on D perform is precisely what we aim to uncover. on probing tasks PS? (2) With additional nudg- Constructing a wide range of systematic tests ing, can models be re-trained, using only a modest is critical for having definitive empirical evidence amount of additional data, to perform well on each of model competence on any given phenomenon. probing task PS with minimal performance loss Such tests should cover a broad set of concepts and on their original tasks D (thus giving evidence of question variations (i.e., systematic adjustments to prior model competence on S)? how the questions are constructed). When assess- While our methodology is general, our exper- ing ISA reasoning, not only is it important to rec- iments focus on probing state-of-the-art MCQA ognize in the question in Figure1 that cooking is a models in the domain of grade-school level sci- learned behavior, but also that cooking is a general ence, which is considered particularly challenging type of behavior or, through a few more inferential with respect to background knowledge and infer- steps, a type of human activity. Our automatic use ence (Clark, 2015; Clark et al., 2019; Khot et al., of expert knowledge sources allows constructing 2020). In addition, existing science benchmarks such high-coverage probes, circumventing pitfalls are known to involve widespread use of defini- of solicitation bias and reporting bias. tion and taxonomic knowledge (see detailed anal- Our results confirm that transformer-based QA ysis by Clark et al.(2018); Boratko et al.(2018)), models3 have a remarkable ability to recognize the which is also fundamental to deeper reasoning. types of knowledge captured in our probes—even Accordingly, we employ the most widely used lex- without additional fine-tuning (i.e., in a zero-shot ical ontology WordNet (Miller, 1995) and publicly setting). Such models can even outperform strong available dictionaries as sources of expert knowl- 2Standard inoculation (Liu et al., 2019a) is known to drop edge to construct our probes, WordNetQA (Sec- performance on the original task. We use a modified objec- 1 tion 3.1) and DictionaryQA (Section 3.2) . These tive (Richardson et al., 2020) to alleviate this issue. 3Different from Talmor et al.(2019a), we find BERT and 1All data and code are available at https://github. RoBERTa based QA models to be qualitatively similar, per- com/allenai/semantic_fragments forming within 5% of each other on nearly all probes. task-specific non-transformer models trained di- (2019), and other variants thereof, which are all rectly on our probing tasks (e.g., +26% compared based on the transformer architecture of Vaswani to a task-specific LSTM). We also show that the et al.(2017). There have been recent stud- same models can be effectively re-fine-tuned on ies into the types of relational knowledge con- small samples (even 100 examples) of probe data, tained in large-scale knowledge models (Schick and that high performance on the probes tends to and Schütze, 2020; Petroni et al., 2019; Jiang correlate with a smaller drop in the model’s per- et al., 2019), which also probe models using struc- formance on the original QA task. tured knowledge sources. These studies, how- Our comprehensive assessment also reveals im- ever, primarily focus on unearthing the knowl- portant nuances to the positive trend.
Recommended publications
  • Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
    ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No.
    [Show full text]
  • A Structural Analysis of Mide Chants
    A Structural Analysis of Mide Chants GEORGE FULFORD McMaster University Introduction In this paper I shall investigate the relationship between words and im­ agery in seven song scrolls used by members of an Ojibwa religious society known as the Midewiwin. These texts were collected in the late 1880s by W.J. Hoffman for the Bureau of American Ethnology and subsequently published in their seventh Annual Report (Hoffman 1891).1 All the pictographs which I shall discuss were inscribed on birch bark and used by members of the Midewiwin to record chants used in their ceremonies. According to Hoffman (1891:192) these chants consisted of only a few words or short phrases. They were sung by single individuals — never in chorus — and were repeated over and over again, usually to the accompaniment of a wooden kettle drum. In a previous study (Fulford 1989) I analyzed patterns of structural variation among these pictographs and outlined how three complex symbols — the otter, bear and bird — evolved from clan emblems into pictographic markers. The focus of my earlier study was purely iconographic; in this paper I shall explore the verbal structure of Midewiwin chants in order to show some of the ways in which they were pictographically encoded. For the sake of convenience, I have limited my discussion to song scrolls sharing the otter symbol. Six of the seven scrolls that I shall examine contain this marking device. Although one (designated Scroll C in the appendix) lacks an otter, it displays many other formal similarities with Hoffman published transcriptions and translations of 23 songs performed at the White Earth Reservation in northwestern Minnesota.
    [Show full text]
  • 5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721
    Internet Engineering Task Force (IETF) P. Faltstrom, Ed. Request for Comments: 5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721 The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) Abstract This document specifies rules for deciding whether a code point, considered in isolation or in context, is a candidate for inclusion in an Internationalized Domain Name (IDN). It is part of the specification of Internationalizing Domain Names in Applications 2008 (IDNA2008). Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc5892. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
    [Show full text]
  • Etir Code Lists
    eTIR Code Lists Code lists CL01 Equipment size and type description code (UN/EDIFACT 8155) Code specifying the size and type of equipment. 1 Dime coated tank A tank coated with dime. 2 Epoxy coated tank A tank coated with epoxy. 6 Pressurized tank A tank capable of holding pressurized goods. 7 Refrigerated tank A tank capable of keeping goods refrigerated. 9 Stainless steel tank A tank made of stainless steel. 10 Nonworking reefer container 40 ft A 40 foot refrigerated container that is not actively controlling temperature of the product. 12 Europallet 80 x 120 cm. 13 Scandinavian pallet 100 x 120 cm. 14 Trailer Non self-propelled vehicle designed for the carriage of cargo so that it can be towed by a motor vehicle. 15 Nonworking reefer container 20 ft A 20 foot refrigerated container that is not actively controlling temperature of the product. 16 Exchangeable pallet Standard pallet exchangeable following international convention. 17 Semi-trailer Non self propelled vehicle without front wheels designed for the carriage of cargo and provided with a kingpin. 18 Tank container 20 feet A tank container with a length of 20 feet. 19 Tank container 30 feet A tank container with a length of 30 feet. 20 Tank container 40 feet A tank container with a length of 40 feet. 21 Container IC 20 feet A container owned by InterContainer, a European railway subsidiary, with a length of 20 feet. 22 Container IC 30 feet A container owned by InterContainer, a European railway subsidiary, with a length of 30 feet. 23 Container IC 40 feet A container owned by InterContainer, a European railway subsidiary, with a length of 40 feet.
    [Show full text]
  • Management of Stroke in Neonates and Children: a Scientific Statement from the American Heart Association/American Stroke Association
    UCSF UC San Francisco Previously Published Works Title Management of Stroke in Neonates and Children: A Scientific Statement From the American Heart Association/American Stroke Association. Permalink https://escholarship.org/uc/item/0sw5j7c1 Journal Stroke, 50(3) ISSN 0039-2499 Authors Ferriero, Donna M Fullerton, Heather J Bernard, Timothy J et al. Publication Date 2019-03-01 DOI 10.1161/str.0000000000000183 Peer reviewed eScholarship.org Powered by the California Digital Library University of California AHA/ASA Scientific Statement Management of Stroke in Neonates and Children A Scientific Statement From the American Heart Association/American Stroke Association The American Academy of Neurology affirms the value of this statement as an educational tool for neurologists. Donna M. Ferriero, MD, MS, FAHA, Co-Chair; Heather J. Fullerton, MD, MAS, Co-Chair; Timothy J. Bernard, MD, MSCS; Lori Billinghurst, MD, MSc, FRCPC; Stephen R. Daniels, MD, PhD; Michael R. DeBaun, MD, MPH; Gabrielle deVeber, MD; Rebecca N. Ichord, MD; Lori C. Jordan, MD, PhD, FAHA; Patricia Massicotte, MSc, MD, MHSc; Jennifer Meldau, MSN; E. Steve Roach, MD, FAHA; Edward R. Smith, MD; on behalf of the American Heart Association Stroke Council and Council on Cardiovascular and Stroke Nursing Purpose—Much has transpired since the last scientific statement on pediatric stroke was published 10 years ago. Although stroke has long been recognized as an adult health problem causing substantial morbidity and mortality, it is also an important cause of acquired brain injury in young patients, occurring most commonly in the neonate and throughout childhood. This scientific statement represents a synthesis of data and a consensus of the leading experts in childhood cardiovascular disease and stroke.
    [Show full text]
  • K191201 Trade/Device Name
    November 15, 2019 Hiossen, Inc. Peter Lee QA/RA Manager 85 Ben Fairless Drive Fairless Hills, Pennsylvania 19030 Re: K191201 Trade/Device Name: EM SA Implant System Regulation Number: 21 CFR 872.3640 Regulation Name: Endosseous Dental Implant Regulatory Class: Class II Product Code: DZE Dated: November 14, 2019 Received: November 15, 2019 Dear Peter Lee: We have reviewed your Section 510(k) premarket notification of intent to market the device referenced above and have determined the device is substantially equivalent (for the indications for use stated in the enclosure) to legally marketed predicate devices marketed in interstate commerce prior to May 28, 1976, the enactment date of the Medical Device Amendments, or to devices that have been reclassified in accordance with the provisions of the Federal Food, Drug, and Cosmetic Act (Act) that do not require approval of a premarket approval application (PMA). You may, therefore, market the device, subject to the general controls provisions of the Act. Although this letter refers to your product as a device, please be aware that some cleared products may instead be combination products. The 510(k) Premarket Notification Database located at https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm identifies combination product submissions. The general controls provisions of the Act include requirements for annual registration, listing of devices, good manufacturing practice, labeling, and prohibitions against misbranding and adulteration. Please note: CDRH does not evaluate information related to contract liability warranties. We remind you, however, that device labeling must be truthful and not misleading. If your device is classified (see above) into either class II (Special Controls) or class III (PMA), it may be subject to additional controls.
    [Show full text]
  • Partitions with Prescribed Hook Differences
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Europ. J. Combinatorics (1987) 8, 341-350 Partitions with Prescribed Hook Differences GEORGE E. ANDREWS,· R. J. BAXTER, D. M. BRESSOUD,t W. H. BURGE, P. J. FORRESTER AND G. VIENNOT We investigate partition identities related to off-diagonal hook differences. Our results generalize previous extensions of the Rogers-Ramanujan identities. The identity of the related polynomials with constructs in statistical mechanics is discussed. 1. INTRODUCTION I. Schur's polynomial proof of the Rogers-Ramanujan identities [15] has been the starting point for several diverse studies. The polynomials Schur introduced were generalized to provide partition identities related to successive ranks of partitions [1], [2], [5], [10), [11], [12], [13]. Subsequently these same polynomials were found to be the key element in providing Rogers-Ramanujan type identities for Regime II of the hard hexagon model [4], [9; ch. 14]. In [6], three of us considered an extensive generalization of the hard hexagon model, and we found that polynomial approximations to certain sums of eigenvalues played an essen­ tial role. Again these polynomials were clearly of the same general structure as those introduced by Schur. In this paper we consider in full generality the families of polynomials that arose in [6]. The statistic of partitions that plays the main role is that of 'off-diagonal hook difference', a topic introduced in [13]. In Section 2 we provide the necessary combinatorial preliminaries. In Section 3 we prove a general partition theorem which generalized the results in [1], [2], [5], [10], [11].
    [Show full text]
  • BASHKIR MANUAL Nicholas Poppe (University of Washington), and Andreas Tietze (University of California, Los Angeles), Consulting Editors
    f,;7 . I i / Indiana University Publications Graduate School Uralic and Altaic Series THOMAS A. SEBEOK, Editor Billie Greene, Assistant to the Editor Fred W. Householder, Jr., John R. Krueger, Felix J. Oinas, Alo Raun, Denis Sinor, and Odan Schiltz, Associate Editors GyuIa Decsy (University of Hamburg), Lawrence Krader (Syracuse Univer­ sity), John Lotz (Columbia University), Samuel E. Martin (Yale University), BASHKIR MANUAL Nicholas Poppe (University of Washington), and Andreas Tietze (University of California, Los Angeles), Consulting Editors Vol. American Studies in Uralic Linguistics, Edited by the Indiana University Committee on Uralic Studies (1960) o.p. Vol. 2 Buriat Grammar, by Nicholas Poppe (1960) - $3.00 Vol. 3 The Structure and Development of the Finnish Language, by Lauri Hakulinen (trans. by John Atkinson) (1961) $7.00 Vol. 4 Dagur Mongolian: Grammar Texts, and Lexicon, by Samuel E. Martin (1961) $5.00 Vol. 5 An Eastern Cheremis Manual: Phonology, Grammar, Texts, and Glossary, by Thomas A. Sebeok and Frances J. Ingemann (1961) - $4.00 Vol. 6 The Phonology of Modern Standard Turkish, byR.B.Lees(1961) - $3.50 Vol. 7 Chuvash Manual: Introduction, Grammar, Reader, and Vocabulary, by John R. Krueger (1961) $5.50 Vol. 8 Buriat Reader, by James E. Bosson (supervised and edited by Nicholas Poppe) (1962) - $2.00 Vol. 9 Latvian and Finnic Linguistic Convergences, by Valdis J. Zeps (1962) - $5.00 Vol. 10 Uzhek Newspaper Reader (with Glossary), by Nicholas Poppe, Jr. (1962) - $2.00 Vol. 11 Hungarian Reader (Folklore and Literature) With Notes, Edited by John Lotz (1962) $1.00 (continued inside back cover) t._._ Vt5f ~ HL9~1 AMERICAN COUNCIL OF LEARNED SOCIETIES Research and Studies in Uralic and Altaic Languages BASHKIR MANUAL Descriptive Grammar and Texts Project No.
    [Show full text]
  • 2Nd Stage International Geopolitical Competition TEST
    2nd stage International Geopolitical Competition TEST 11 VII 2020 RULES • The test duration is 60 minutes and it contains 60 test questions. You shall put your answers in previously sent form and it shouldn’t be sent back later than midday (Warsaw time). • You are prohibited to use any help from other people. Your personal notes, books, or the Internet is allowed. • For correct response you will gain 1 point, for no answer there is no penalty, and for wrong one there is -1 point penalty. • You are not allowed to change previously given answers. To answer a question put X mark on a table in your printed form. 1. What does powermetrics as subdiscipline of geopolitics deal with? A. History of geopolitical doctrines B. Statistics of international agreements C. Biographies of outstanding geopoliticians D. Measuring the length of national borders E. Measuring the power of political units F. Measurement of the soldiers’ physical ability QA: prof.. dr hab.. M. Sułek 2. Who introduced the concept of soft power to international relations science? A. Niccoló Machiavelli B. Rudolf Kjellén C. John Mearsheimer D. Karl Haushofer E. Joseph Nye F. Henry Kissinger QA: prof.. dr hab.. M. Sułek 3. Smart power is the connected way of using of: A. Hard power and sharp power B. Hard power and soft power C. Hard power and sticky power D. Soft power and sticky power E. Sharp power and soft power F. Sharp power and sticky power QA: prof.. dr hab.. M. Sułek 4. The main determinant of the state’s international position is: A.
    [Show full text]
  • Delaware River Water Quality ^Ristol to Marcus Hook Pennsylvania August 1949 to December 1963 R«» W
    Delaware River Water Quality ^ristol to Marcus Hook Pennsylvania August 1949 to December 1963 r«» W. B. KEIGHTON ONTRIBUTIONS TO THE HYDROLOGY OF THE UNITED STATES GEOLOGICAL SURVEY WATER-SUPPLY PAPER 1809-O ^repared in cooperation with the :ity of Philadelphia '.JNITED STATES GOVERNMENT PRINTING OFFICE, WASHINGTON : 1965 UNITED STATES DEPARTMENT OF THE INTERIOR STEWART L. UDALL, Secretory GEOLOGICAL SURVEY Thomas B. Nolan, Director For sale by the Superintendent of Documents, U.S. Government Printing Office Washington, D.C. 20402 - Price 25 cents (paper cover) CONTENTS Page Abstract_______________________________________________________ 01 Introduction. _____________________________________________________ 1 Purpose and scope_____________________________________________ 4 Previous investigations and acknowledgments.____________________ 4 Some basic physical and hydrologic facts______.______________________ 5 Graphic summaries of water-quality facts__________________________ 9 Water temperature____________________________________________ 9 Hydrogen-ion concentration (pH)______________________________ 10 Dissolved-solids concentration_________________________________ 10 Composition of dissolved solids__________________________________ 19 Chloride___________________________________-_ 24 Dissolved oxygen._____________________________________________ 27 Suspended sediment.__________________________________________ 30 Tabular summaries of water-quality characteristics ____________________ 36 Local variations in water-quality patterns___________________-__--___-
    [Show full text]
  • Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning
    Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning Harsh Trivedi†∗ Niranjan Balasubramanian† Tushar Khot‡ Ashish Sabharwal‡ † Stony Brook University, Stony Brook, U.S.A. fhjtrivedi,[email protected] ‡ Allen Institute for AI, Seattle, U.S.A. ftushark,[email protected] Abstract Input Which country got independence when the cold war started? Has there been real progress in multi-hop The war started in 1950. question-answering? Models often exploit The cold war started in 1947. dataset artifacts to produce correct answers, France finally got its independence. without connecting information across multi- India got independence from UK in 1947. Set of Facts ple supporting facts. This limits our ability 30 countries were involved in World War 2. to measure true progress and defeats the pur- pose of building multi-hop QA datasets. We Disconnected Reasoning Output make three contributions towards addressing Answer SF Simple Combination this. First, we formalize such undesirable be- Answer havior as disconnected reasoning across sub- India India sets of supporting facts. This allows develop- No Interaction Supporting Facts (SF) ing a model-agnostic probe for measuring how much any model can cheat via disconnected reasoning. Second, using a notion of con- trastive support sufficiency, we introduce an Figure 1: Example of disconnected reasoning, a form automatic transformation of existing datasets of bad multifact reasoning: Model arrives at the answer that reduces the amount of disconnected rea- by simply combining its outputs from two subsets of soning. Third, our experiments1 suggest that the input, neither of which contains all supporting facts. there hasn’t been much progress in multifact From one subset, it identifies the blue supporting fact QA in the reading comprehension setting.
    [Show full text]
  • A Comparison of Circle and J Hook Performance Within the Grenadian Pelagic Longline Fishery Anthony G
    Nova Southeastern University NSUWorks HCNSO Student Theses and Dissertations HCNSO Student Work 4-25-2019 A Comparison of Circle and J Hook Performance within the Grenadian Pelagic Longline Fishery Anthony G. Burns Nova Southeastern University, [email protected] Follow this and additional works at: https://nsuworks.nova.edu/occ_stuetd Part of the Marine Biology Commons, and the Oceanography and Atmospheric Sciences and Meteorology Commons Share Feedback About This Item NSUWorks Citation Anthony G. Burns. 2019. A Comparison of Circle and J Hook Performance within the Grenadian Pelagic Longline Fishery. Master's thesis. Nova Southeastern University. Retrieved from NSUWorks, . (510) https://nsuworks.nova.edu/occ_stuetd/510. This Thesis is brought to you by the HCNSO Student Work at NSUWorks. It has been accepted for inclusion in HCNSO Student Theses and Dissertations by an authorized administrator of NSUWorks. For more information, please contact [email protected]. Thesis of Anthony G. Burns Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science M.S. Marine Biology Nova Southeastern University Halmos College of Natural Sciences and Oceanography April 2019 Approved: Thesis Committee Major Professor: David Kerstetter Committee Member: Amy C. Hirons, Ph.D. Committee Member: Paul T. Arena, Ph.D. This thesis is available at NSUWorks: https://nsuworks.nova.edu/occ_stuetd/510 HALMOS COLLEGE OF NATURAL SCIENCES AND OCEANOGRAPHY A COMPARISON OF CIRCLE AND J HOOK PERFORMANCE WITHIN THE GRENADIAN PELAGIC LONGLINE FISHERY By: Anthony G. Burns Submitted to the Faculty of Halmos College of Natural Sciences and Oceanography in partial fulfillment of the requirements for the degree of Master of Science with a specialty in: Marine Biology Nova Southeastern University May 2019 Thesis of Anthony G.
    [Show full text]