
PuzzLing Machines: A Challenge on Learning From Small Data Gozde¨ Gul¨ S¸ahiny, Yova Kementchedjhievaα, Phillip Rusty, Iryna Gurevychy yUbiquitous Knowledge Processing Lab (UKP), Department of Computer Science, Technical University of Darmstadt αDepartment of Computer Science, University of Copenhagen ywww.ukp.tu-darmstadt.de Abstract Chikasaw English 1. Ofi’at kowi’a˜ lhiyohli. The dog chases the cat. Deep neural models have repeatedly proved 2. Kowi’at ofi’a˜ lhiyohli. The cat chases the dog. excellent at memorizing surface patterns from 3. Ofi’at shoha. The dog stinks. large datasets for various ML and NLP bench- 4. Ihooat hattaka˜ hollo. The woman loves the man. marks. They struggle to achieve human-like 5. Lhiyohlili. I chase her/him. thinking, however, because they lack the skill 6. Salhiyohli. She/he chases me. of iterative reasoning upon knowledge. To ex- 7. Hilha. She/he dances. pose this problem in a new light, we intro- Now you can translate the following into Chickasaw: duce a challenge on learning from small data, The man loves the woman. PuzzLing Machines, which consists of Rosetta The cat stinks. Stone puzzles from Linguistic Olympiads for I love her/him. high school students. These puzzles are care- Translate the following into English: fully designed to contain only the minimal Ihooat sahollo. amount of parallel text necessary to deduce Ofi’at hilha. the form of unseen expressions. Solving them Kowi’a˜ lhiyohlili. does not require external information (e.g., knowledge bases, visual signals) or linguis- Table 1: The “Chickasaw” puzzle (Payne, 2005) tic expertise, but meta-linguistic awareness and deductive skills. Our challenge contains around 100 puzzles covering a wide range to enable learning from just a few examples. This of linguistic phenomena from 81 languages. System2-style modeling is still in its early stages in We show that both simple statistical algo- DL, but is recognized as a much needed next step rithms and state-of-the-art deep neural mod- in the field (McClelland et al., 2019; Marcus, 2020; els perform inadequately on this challenge, as expected. We hope that this benchmark, LeCun, 2020; Bengio, 2020). To foster research in available at https://ukplab.github.io/ this promising direction, we propose a unique chal- PuzzLing-Machines/, inspires further ef- lenge on “learning from small data”: PuzzLing Ma- forts towards a new paradigm in NLP—one chines, based on the Linguistic Olympiads—one of that is grounded in human-like reasoning and the 13 recognized International Science Olympiads understanding. targeted at high-school students. 1 Introduction The PuzzLing Machines challenge is based on arXiv:2004.13161v1 [cs.CL] 27 Apr 2020 one of the most common puzzle types in the Kahneman(2011) discusses the two modes of hu- Linguistic Olympiads: the Rosetta Stone puz- man thinking which perfectly encapsulate the cur- zles (Bozhanov and Derzhanski, 2013), a.k.a. trans- rent (so called System1) and the desired state (Sys- lation puzzles. An example is given in Table1. 1 tem1+System2) of the deep learning field. System1 Although these puzzles take the form of a tradi- handles tasks that humans consider fast, intuitive tional “machine translation” task, they are differ- and automatic, such as object detection and doc- ent in many ways: Rosetta Stone puzzles contain ument classification. Recent deep learning (DL) a minimal, carefully designed set of parallel ex- models have shown great promise at this type of pressions (words, phrases or sentences) in a for- tasks—thanks to large training datasets. Yet, it is through slow, rational and sequential mecha- 1Copyright University of Oregon, Department of Linguis- nisms that human-like abstract reasoning happens, tics. eign and in a familiar language (e.g., Chickasaw- language models as applied to these puzzles. We English). This minimal set is just enough to deduce show that, unsurprisingly, the puzzles cannot be the underlying translation model, which typically easily or robustly solved by currently existing meth- involves deriving mini-grammar rules, extracting a ods. We hope that this benchmark is going to evoke lexicon, and discovering morphological and phono- development of new deep MT/NLP models that op- logical rules. The actual task then is to translate erate in a human-like manner and reason upon lin- new expressions—generally in both directions— guistic knowledge, providing a new future research using the model deduced from the parallel data. direction for NLP. The assignments are carefully designed so that the expressions cannot be generated through simple 2 Meta-linguistics analogy, but rather through the application of the discovered rules. These properties distinguish the Meta-linguistics is defined by Chomsky(1976) as PuzzLing Machines challenge from the modern MT “the knowledge of the characteristics and structures task, as it relies on deductive reasoning with linguis- of language” as realised on the level of phonology, tic concepts that are central to System2, rather than morphology, syntax and semantics. Any English exploiting statistical properties from large datasets speaker would likely have the linguistic capacity to as in System1. produce the word undo when asked “What is the opposite of do?” Only a speaker with some level of The lack of reasoning skills of statistical sys- meta-linguistic awareness, however, would further tems has recently gained a lot of attention. Var- be able to reflect on the structure of the word they ious datasets that require a wide range of back- have produced: to identify un- as a unit that serves ground knowledge and different types of rea- to negate words, to spot its similarity in function soning abilities have been introduced, such as to other units like dis- and de-. He/she would also ARC (Clark et al., 2018), GQA (Hudson and Man- be aware that un- is not interchangeable with dis- ning, 2019), GLUE benchmarks (Wang et al., 2018) and de-, since it attaches to the front of verbs and and SWAG (Zellers et al., 2018). Our challenge adjectives but not to nouns. distinguishes from previous benchmarks with some Meta-linguistic awareness is especially useful key properties. First, most of these reasoning (and often improved) in the process of learning a tasks require external scientific or visual knowl- new language, as it allows the learner to compare edge, which makes it hard to measure the actual and contrast the structure and characteristics of the reasoning performance. On the other hand, our new language to those that he/she is already famil- challenge does not rely on any external, multimodal iar with. It is desirable that systems for natural or expert-level information. Second, and more im- language processing possess meta-linguistic aware- portantly, PuzzLing challenge consists of a minimal ness, too, as that could hugely improve their cross- set of examples required for solution. That means, lingual generalizability, a problem that remains there exists no extra training data, ensuring that open after being approached from various engineer- exploiting surface patterns would not be possible ing perspectives, often with little recourse to lin- unlike in some of existing benchmarks (Gururan- guistics. However, measuring the meta-linguistic gan et al., 2018). awareness of a system is not trivial. Existing prob- In summary, this paper introduces a unique ing techniques are mostly designed to measure how challenge, PuzzLing Machines, made up of ∼100 well neural models capture specific linguistic phe- Rosetta Stone, a.k.a translation puzzles covering nomena, e.g., whether a specific layer of an English 81 languages from 39 different language families language model can capture that undo is negative, based on the Linguistic Olympiads. The challenge instead of testing for meta-linguistic awareness. requires System2 skills—sequential reasoning and Our challenge takes a step further and tests whether abstraction of linguistic concepts, discussed in de- the model can apply the underlying morphological tail in x2. We discuss the dataset and the linguistic processes, e.g. of verbal negation through prefix- phenomena in the resulting dataset supported with ing. In addition, our challenge spans a wide-range statistics and examples in x3. In x4, we present the of language families and covers a variety of lin- results of intuitive baseline methods and strong MT guistic phenomena (see x3.1), that qualifies it as baselines such as Transformers encoder-decoder a favorable testbed for measuring meta-linguistic (Vaswani et al., 2017) with integrated pretrained awareness. Let us demonstrate how meta-linguistic reason- analysis. Expert solutions are available for most ing skills are used to solve the “Chickasaw puzzle” puzzles; we excluded the rest. In addition to the given in Table1. The translation model is itera- translation puzzle type shown in Table1, we also tively deduced as follows: (1) the word order in collected ‘matching’ puzzles. These are two-step Chickasaw is Subject-Object-Verb (SOV), unlike puzzles, in which the participants first align a shuf- the English SVO word order; (2) nouns take dif- fled set of sentences to obtain parallel data, and then ferent suffixes when in a subject or object position translate a set of unseen sentences. We converted (at and a˜, respectively); (3) verbs take a suffix for these puzzles to the translation puzzle format by 1st person singular pronomial subject or object (li referring to the solution files to align the training and sa, respectively). Notice that, crucially, it is sentence pairs. Appendix A.1 describes how we not possible to learn the function of the prefix sa, selected the puzzles and how we transcribed them which corresponds to me in English, without de- into a machine-readable format. ducing that lhiyohli corresponds to the verb chases The final dataset contains 96 unique puzzles and that third person agency in Chickasaw is not from 81 languages that span 39 different language explicitly expressed.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages14 Page
-
File Size-