Extending Machine Language Models Toward Human-Level Language Understanding

Extending Machine Language Models toward Human-Level Language Understanding James L. McClellanda,b,2, Felix Hillb,2, Maja Rudolphc,2, Jason Baldridged,1,2, and Hinrich Schützee,1,2 aStanford University, Stanford, CA 94305, USA; bDeepMind, London N1C 4AG, UK; cBosch Center for Artificial Intelligence, Renningen, 71272, Germany; dGoogle Research, Austin, TX 78701, USA; eLMU Munich, Munich, 80538, Germany This manuscript was compiled on July 7, 2020 Language is crucial for human intelligence, but what exactly is its for modeling cognition (2). This work introduced the idea that role? We take language to be a part of a system for understand- structure in cognition and language is emergent: it is captured ing and communicating about situations. The human ability to un- in learned connection weights supporting the construction of derstand and communicate about situations emerges gradually from context-sensitive representations whose characteristics reflect experience and depends on domain-general principles of biological a gradual, input-statistics dependent, learning process (3). neural networks: connection-based learning, distributed represen- Classical linguistic theory and most computational linguistics tation, and context-sensitive, mutual constraint satisfaction-based employs discrete symbols and explicit rules to characterize processing. Current artificial language processing systems rely on language structure and relationships. In neural networks, these the same domain general principles, embodied in artificial neural net- symbols are replaced by continuous, multivariate patterns works. Indeed, recent progress in this field depends on query-based called distributed representations or embeddings and the rules attention, which extends the ability of these systems to exploit con- are replace by continuous, multi-valued arrays of connection text and has contributed to remarkable breakthroughs. Nevertheless, weights that map patterns to other patterns. most current models focus exclusively on language-internal tasks, Since its introduction (3), debate has raged about this limiting their ability to perform tasks that depend on understanding approach to language processing (4). Protagonists argue it situations. These systems also lack memory for the contents of prior supports nuanced, context- and similarity-sensitive processing situations outside of a fixed contextual span. We describe the orga- that is reflected in the quasi-regular relationships between nization of the brain’s distributed understanding system, which in- phrases and their sounds, spellings, and meanings (5, 6). These cludes a fast learning system that addresses the memory problem. models also capture subtle aspects of human performance in We sketch a framework for future models of understanding draw- language tasks (7). However, critics note that neural networks ing equally on cognitive neuroscience and artificial intelligence and often fail to generalize beyond their training data, blaming exploiting query-based attention. We highlight relevant current di- these failures on the absence of explicit rules (8–10). rections and consider further developments needed to fully capture Another key principle is mutual constraint satisfaction (11). human-level language understanding in a computational system. For example, interpreting a sentence requires resolving both syntactic and semantic ambiguity. If we hear A boy hit a man Natural Language Understanding | Deep Learning | Situation Models | with a bat, we tend to assume with a bat attaches to the verb Cognitive Neuroscience | Artificial Intelligence (syntax) and thereby the instrument of hitting (semantics). However, if beard replaces bat, then with a beard is attached Striking recent advances in machine intelligence have ap- to man (syntax) and describes the person affected (semantics) peared in language tasks. Machines better transcribe speech (12). Even segmenting language into elementary units depends and respond in ever more natural sounding voices. Widely on meaning and context (Fig.1). Rumelhart ( 11) envisioned available applications allow one to say something in one lan- a model in which estimates of the probability of all aspects of guage and hear its translation in another. Humans perform an input constrain estimates of the probabilities of all others, better than machines in most language tasks, but these systems motivating a model of context effects in perception (13) that work well enough to be used by billions of people everyday. launched the PDP approach. What underlies these successes? What limitations do they arXiv:1912.05877v2 [cs.CL] 4 Jul 2020 face? We argue that progress has come from exploiting principles of neural computation employed by the human brain, Neural Language Modeling while a key limitation is that these systems treat language as Initial steps. Elman (14) introduced a simple recurrent neural if it can stand alone. We propose that language works in con- network (RNN) (Fig.2a) that captured key characteristics of cert with other inputs to understand and communicate about language structure through learning, a feat once considered situations. We describe key aspects of human understanding impossible (15). It was trained to predict the next word in a and key components of the brain’s understanding system. We sequence (w(t + 1)) based on the current word (w(t)) and its then propose initial steps toward a model informed both by own hidden (that is, learned internal) representation from the cognitive neuroscience and artificial intelligence and point to previous time step (h(t−1)). Each of these inputs is multiplied extensions addressing more abstract cases. by a matrix of connection weights (arrows labeled Whi and Principles of Neural Computation JM, FH, MR, JB, and HS wrote the paper. The principles of neural computation are domain general, The authors declare no conflict of interest. inspired by the human brain and human abilities. They were 1J.B. and H.S. contributed equally. first articulated in the 1950s (1) and further developed in the 2 To whom correspondence should be addressed. E-mail: jlmccstanford.edu, [email protected], 1980s in the Parallel Distributed Processing (PDP) framework [email protected], [email protected] or [email protected] 1 Fig. 1. Context influences the identification of letters in written text: the visual input we read as went in the first sentence and event in the second is the same bit of Rumelhart’s handwriting, cut and pasted into each context. Reprinted from (11). Whh in Fig.2a) and the results are added to produce the input to the hidden units. The elements of this vector pass through a function limiting the range of their values, producing the hidden representation. This in turn is multiplied with weights Fig. 2. (a) Elman’s (1990) simple recurrent network and (b) his hierarchical clustering of the representations it learned, reprinted from (14). to the output layer from the hidden layer (Woh) to generate a vector used to predict the probability of each of the possible successor words. Learning is based on the discrepancy between relationships of all the words in the corpus, improving gener- the network’s output and the actual next word; the values alization: task-focused neural models trained on small data of the connection weights are adjusted by a small amount to sets better generalize to infrequent words (e.g., settee) based reduce the discrepancy. The network is recurrent because the on frequent words (e.g. couch) with similar embeddings. same connection weights (denoted by arrows in the figure) are A second challenge is the indefinite length of the context used to process each successive word. that might be relevant for prediction. Consider this passage: Elman showed two things. First, after training his network to predict the next word in sentences like man eats bread, dog John put some beer in a cooler and went out with his chases cat, and girl sleeps, the network’s representations cap- friends to play volleyball. Soon after he left, someone tured the syntactic distinction between nouns and verbs (14). took the beer out of the cooler. John and his friends They also captured interpretable subcategories, as shown by a were thirsty after the game, and went back to his place hierarchical clustering of the hidden representations of the dif- for some beers. When John opened the cooler, he ferent words (Fig.2b). This illustrates a key feature of learned discovered that the beer was ___. representations: they capture specific as well as general or Here a reader expects the missing word to be gone. Yet if we abstract information. By using a different learned representa- replace took the beer with took the ice, the expected word is tion for each word, its specific predictive consequences can be warm. Any amount of additional text between beer and gone exploited. Because representations for words that make similar does not change the predictive relationship, challenging RNNs predictions are similar, and because neural networks exploit like Elman’s. An innovation called Long-short-term memory similarity, the network can share knowledge about predictions (LSTM) (20) partially addressed this problem by augmenting among related words. the recurrent network architecture with learned connection Second, Elman (16) used both simple sentences like boy weights that gate information into and out of a network’s chases dogs and more complex ones like boy who sees girls internal state. However, LSTMs did not fully alleviate the chases dogs. In the latter, the verb

Load more