Arxiv:2004.10151V3 [Cs.CL] 2 Nov 2020
Total Page:16
File Type:pdf, Size:1020Kb
Experience Grounds Language Yonatan Bisk* Ari Holtzman* Jesse Thomason* Jacob Andreas Yoshua Bengio Joyce Chai Mirella Lapata Angeliki Lazaridou Jonathan May Aleksandr Nisnevich Nicolas Pinto Joseph Turian Abstract Meaning is not a unique property of language, but a general characteristic of human activity ... We cannot Language understanding research is held back say that each morpheme or word has a single or central meaning, or even that it has a continuous or coherent by a failure to relate language to the physical range of meanings ... there are two separate uses and world it describes and to the social interactions meanings of language – the concrete ... and the abstract. it facilitates. Despite the incredible effective- ness of language processing models to tackle Zellig S. Harris (Distributional Structure 1954) tasks after being trained on text alone, success- ful linguistic communication relies on a shared trained solely on text corpora, even when those cor- experience of the world. It is this shared expe- pora are meticulously annotated or Internet-scale. rience that makes utterances meaningful. You can’t learn language from the radio. Nearly Natural language processing is a diverse field, every NLP course will at some point make this and progress throughout its development has claim. The futility of learning language from lin- come from new representational theories, mod- guistic signal alone is intuitive, and mirrors the eling techniques, data collection paradigms, belief that humans lean deeply on non-linguistic and tasks. We posit that the present success knowledge (Chomsky, 1965, 1980). However, as of representation learning approaches trained a field we attempt this futility: trying to learn lan- on large, text-only corpora requires the paral- guage from the Internet, which stands in as the lel tradition of research on the broader physi- cal and social context of language to address modern radio to deliver limitless language. In this the deeper questions of communication. piece, we argue that the need for language to attach to “extralinguistic events" (Ervin-Tripp, 1973) and Improvements in hardware and data collection the requirement for social context (Baldwin et al., have galvanized progress in NLP across many 1996) should guide our research. benchmark tasks. Impressive performance has been Drawing inspiration from previous work in NLP, achieved in language modeling (Radford et al., Cognitive Science, and Linguistics, we propose the 2019; Zellers et al., 2019b; Keskar et al., 2019) and notion of a World Scope (WS) as a lens through span-selection question answering (Devlin et al., which to audit progress in NLP. We describe five 2019; Yang et al., 2019b; Lan et al., 2020) through WSs, and note that most trending work in NLP massive data and massive models. With models operates in the second (Internet-scale data). arXiv:2004.10151v3 [cs.CL] 2 Nov 2020 exceeding human performance on such tasks, now We define five levels of World Scope: is an excellent time to reflect on a key question: WS1. Corpus (our past) WS2. Internet (most of current NLP) Where is NLP going? WS3. Perception (multimodal NLP) In this paper, we consider how the data and world WS4. Embodiment a language learner is exposed to define and con- WS5. Social strains the scope of that learner’s semantics. Mean- These World Scopes go beyond text to consider ing does not arise from the statistical distribution the contextual foundations of language: grounding, of words, but from their use by people to communi- embodiment, and social interaction. We describe a cate. Many of the assumptions and understandings brief history and ongoing progression of how con- on which communication relies lie outside of text. textual information can factor into representations We must consider what is missing from models and tasks. We conclude with a discussion of how 100 this integration can move the field forward. We be- Harris 1954 80 lieve this World Scope framing serves as a roadmap Firth 1957 60 Chomsky 1957 for truly contextual language understanding. 40 20 1 WS1: Corpora and Representations % of 2019 Citations 0 1960 1970 1980 1990 2000 2010 2020 Year The story of data-driven language research begins Academic interest in Firth and Harris increases dramatically with the corpus. The Penn Treebank (Marcus et al., around 2010, perhaps due to the popularization of Firth(1957) 1993) is the canonical example of a clean subset of “You shall know a word by the company it keeps." naturally generated language, processed and anno- tated for the purpose of studying representations. Such corpora and the model representations built ture linguistic intuitions without supervision, but from them exemplify WS1. Community energy they are constrained by the structure they impose was initially directed at finding formal linguistic with respect to the number of classes chosen. structure, such as recovering syntax trees. Recent The intuition that meaning requires a large con- success on downstream tasks has not required such text, that “You shall know a word by the company explicitly annotated signal, leaning instead on un- it keeps." – Firth(1957), manifested early via La- structured fuzzy representations. These representa- tent Semantic Indexing/Analysis (Deerwester et al., tions span from dense word vectors (Mikolov et al., 1988, 1990; Dumais, 2004) and later in the gen- 2013) to contextualized pretrained representations erative framework of Latent Dirichlet Allocation (Peters et al., 2018; Devlin et al., 2019). (Blei et al., 2003). LDA represents a document as Word representations have a long history predat- a bag-of-words conditioned on latent topics, while ing the recent success of deep learning methods. LSI/A use singular value decomposition to project Outside of NLP, philosophy (Austin, 1975) and lin- a co-occurrence matrix to a low dimensional word guistics (Lakoff, 1973; Coleman and Kay, 1981) vector that preserves locality. These methods dis- recognized that meaning is flexible yet structured. card sentence structure in favor of the document. Early experiments on neural networks trained with Representing words through other words is a sequences of words (Elman, 1990; Bengio et al., comfortable proposition, as it provides the illusion 2003) suggested that vector representations could of definitions by implicit analogy to thesauri and capture both syntax and semantics. Subsequent related words in a dictionary definition. However, experiments with larger models, documents, and the recent trends in deep learning approaches to corpora have demonstrated that representations language modeling favor representing meaning in learned from text capture a great deal of informa- fixed-length vectors with no obvious interpretation. tion about meaning in and out of context (Collobert The question of where meaning resides in “connec- and Weston, 2008; Turian et al., 2010; Mikolov tionist” systems like Deep Neural Networks is an et al., 2013; McCann et al., 2017). old one (Pollack, 1987; James and Miikkulainen, The intuition of such embedding representations, 1995). Are concepts distributed through edges or that context lends meaning, has long been acknowl- local to units in an artificial neural network? edged (Firth, 1957; Turney and Pantel, 2010). Ear- “... there has been a long and unresolved lier on, discrete, hierarchical representations, such debate between those who favor localist as agglomerative clustering guided by mutual in- representations in which each process- formation (Brown et al., 1992), were constructed ing element corresponds to a meaningful with some innate interpretability. A word’s position concept and those who favor distributed in such a hierarchy captures semantic and syntac- representations.” Hinton(1990) tic distinctions. When the Baum–Welch algorithm Special Issue on Connectionist Symbol Processing (Welch, 2003) is applied to unsupervised Hidden Markov Models, it assigns a class distribution to In connectionism, words were no longer defined every word, and that distribution is a partial rep- over interpretable dimensions or symbols, which resentation of a word’s “meaning.” If the set of were perceived as having intrinsic meaning. The classes is small, syntax-like classes are induced; tension of modeling symbols and distributed repre- if the set is large, classes become more semantic. sentations is articulated by Smolensky(1990), and These representations are powerful in that they cap- alternative representations (Kohonen, 1984; Hinton et al., 1986; Barlow, 1989) and approaches to struc- have made progress by substantially increasing the ture and composition (Erk and Padó, 2008; Socher number of model parameters to better consume et al., 2012) span decades of research. these vast quantities of data. Where Peters et al. The Brown Corpus (Francis, 1964) and Penn (2018) introduced ELMo with ∼108 parameters, Treebank (Marcus et al., 1993) defined context and Transformer models (Vaswani et al., 2017) have structure in NLP for decades. Only relatively re- continued to scale by orders of magnitude between cently (Baroni et al., 2009) has the cost of annota- papers (Devlin et al., 2019; Radford et al., 2019; tions decreased enough, and have large-scale web- Zellers et al., 2019b) to ∼1011 (Brown et al., 2020). crawls become viable, to enable the introduction of Current models are the next (impressive) step more complex text-based tasks. This transition to in language modeling which started with Good larger, unstructured context (WS2) induced a richer (1953), the weights of Kneser