<<

Extending Machine Models toward -Level Language Understanding

James L. McClellanda,b,2, Felix Hillb,2, Maja Rudolphc,2, Jason Baldridged,1,2, and Hinrich Schützee,1,2

aStanford University, Stanford, CA 94305, USA; bDeepMind, London N1C 4AG, UK; cBosch Center for Artificial Intelligence, Renningen, 71272, Germany; dGoogle Research, Austin, TX 78701, USA; eLMU Munich, Munich, 80538, Germany

This manuscript was compiled on July 7, 2020

Language is crucial for human intelligence, but what exactly is its for modeling cognition (2). This work introduced the idea that role? We take language to be a part of a system for understand- structure in cognition and language is emergent: it is captured ing and communicating about situations. The human ability to un- in learned connection weights supporting the construction of derstand and communicate about situations emerges gradually from context-sensitive representations whose characteristics reflect experience and depends on domain-general principles of biological a gradual, input-statistics dependent, process (3). neural networks: connection-based learning, distributed represen- Classical linguistic theory and most computational tation, and context-sensitive, mutual constraint satisfaction-based employs discrete symbols and explicit rules to characterize processing. Current artificial language processing systems rely on language structure and relationships. In neural networks, these the same domain general principles, embodied in artificial neural net- symbols are replaced by continuous, multivariate patterns works. Indeed, recent progress in this field depends on query-based called distributed representations or embeddings and the rules , which extends the ability of these systems to exploit con- are replace by continuous, multi-valued arrays of connection text and has contributed to remarkable breakthroughs. Nevertheless, weights that map patterns to other patterns. most current models focus exclusively on language-internal tasks, Since its introduction (3), debate has raged about this limiting their ability to perform tasks that depend on understanding approach to language processing (4). Protagonists argue it situations. These systems also lack for the contents of prior supports nuanced, context- and similarity-sensitive processing situations outside of a fixed contextual span. We describe the orga- that is reflected in the quasi-regular relationships between nization of the ’s distributed understanding system, which in- phrases and their sounds, spellings, and meanings (5, 6). These cludes a fast learning system that addresses the memory problem. models also capture subtle aspects of human performance in We sketch a framework for future models of understanding draw- language tasks (7). However, critics note that neural networks ing equally on and artificial intelligence and often fail to generalize beyond their training data, blaming exploiting query-based attention. We highlight relevant current di- these failures on the absence of explicit rules (8–10). rections and consider further developments needed to fully capture Another key principle is mutual constraint satisfaction (11). human-level language understanding in a computational system. For example, interpreting a sentence requires resolving both syntactic and semantic . If we hear A boy hit a man Natural Language Understanding | Deep Learning | Situation Models | with a bat, we tend to assume with a bat attaches to the verb Cognitive Neuroscience | Artificial Intelligence () and thereby the instrument of hitting (). However, if beard replaces bat, then with a beard is attached Striking recent advances in machine intelligence have ap- to man (syntax) and describes the person affected (semantics) peared in language tasks. Machines better transcribe speech (12). Even segmenting language into elementary units depends and respond in ever more natural sounding voices. Widely on and context (Fig.1). Rumelhart ( 11) envisioned available applications allow one to say something in one lan- a model in which estimates of the probability of all aspects of guage and hear its in another. perform an input constrain estimates of the probabilities of all others, better than machines in most language tasks, but these systems motivating a model of context effects in (13) that work well enough to be used by billions of people everyday. launched the PDP approach. What underlies these successes? What limitations do they arXiv:1912.05877v2 [cs.CL] 4 Jul 2020 face? We argue that progress has come from exploiting prin- ciples of neural computation employed by the , Neural Language Modeling while a key limitation is that these systems treat language as Initial steps. Elman (14) introduced a simple recurrent neural if it can stand alone. We propose that language works in con- network (RNN) (Fig.2a) that captured key characteristics of cert with other inputs to understand and communicate about language structure through learning, a feat once considered situations. We describe key aspects of human understanding impossible (15). It was trained to predict the next in a and key components of the brain’s understanding system. We sequence (w(t + 1)) based on the current word (w(t)) and its then propose initial steps toward a model informed both by own hidden (that is, learned internal) representation from the cognitive neuroscience and artificial intelligence and point to previous time step (h(t−1)). Each of these inputs is multiplied extensions addressing more abstract cases. by a matrix of connection weights (arrows labeled Whi and

Principles of Neural Computation JM, FH, MR, JB, and HS wrote the paper. The principles of neural computation are domain general, The authors declare no conflict of interest. inspired by the human brain and human abilities. They were 1J.B. and H.S. contributed equally.

first articulated in the 1950s (1) and further developed in the 2 To whom correspondence should be addressed. E-mail: jlmccstanford.edu, [email protected], 1980s in the Parallel Distributed Processing (PDP) framework [email protected], [email protected] or [email protected]

1 Fig. 1. Context influences the identification of letters in written text: the visual input we read as went in the first sentence and event in the second is the same bit of Rumelhart’s handwriting, cut and pasted into each context. Reprinted from (11).

Whh in Fig.2a) and the results are added to produce the input to the hidden units. The elements of this vector pass through a function limiting the range of their values, producing the hidden representation. This in turn is multiplied with weights Fig. 2. (a) Elman’s (1990) simple recurrent network and (b) his hierarchical clustering of the representations it learned, reprinted from (14). to the output layer from the hidden layer (Woh) to generate a vector used to predict the probability of each of the possible successor . Learning is based on the discrepancy between relationships of all the words in the corpus, improving gener- the network’s output and the actual next word; the values alization: task-focused neural models trained on small data of the connection weights are adjusted by a small amount to sets better generalize to infrequent words (e.g., settee) based reduce the discrepancy. The network is recurrent because the on frequent words (e.g. couch) with similar embeddings. same connection weights (denoted by arrows in the figure) are A second challenge is the indefinite length of the context used to process each successive word. that might be relevant for prediction. Consider this passage: Elman showed two things. First, after training his network to predict the next word in sentences like man eats bread, dog John put some beer in a cooler and went out with his chases cat, and girl sleeps, the network’s representations cap- friends to play volleyball. Soon after he left, someone tured the syntactic distinction between nouns and verbs (14). took the beer out of the cooler. John and his friends They also captured interpretable subcategories, as shown by a were thirsty after the game, and went back to his place hierarchical clustering of the hidden representations of the dif- for some beers. When John opened the cooler, he ferent words (Fig.2b). This illustrates a key feature of learned discovered that the beer was ___. representations: they capture specific as well as general or Here a reader expects the missing word to be gone. Yet if we abstract information. By using a different learned representa- replace took the beer with took the ice, the expected word is tion for each word, its specific predictive consequences can be warm. Any amount of additional text between beer and gone exploited. Because representations for words that make similar does not change the predictive relationship, challenging RNNs predictions are similar, and because neural networks exploit like Elman’s. An innovation called Long-short-term memory similarity, the network can share knowledge about predictions (LSTM) (20) partially addressed this problem by augmenting among related words. the recurrent network architecture with learned connection Second, Elman (16) used both simple sentences like boy weights that gate information into and out of a network’s chases dogs and more complex ones like boy who sees girls internal state. However, LSTMs did not fully alleviate the chases dogs. In the latter, the verb chases must agree with context bottleneck problem (21): a network’s internal state the first noun (boy), not the closest noun (girls), since the was still a fixed-length vector, limiting its ability to capture sentence contains a main clause (boy chases dogs) interrupted contextual information. by a reduced relative clause (boy [who] sees girls). The model learned to predict the verb form correctly despite the interven- Query-based attention. Recent breakthroughs depend on an ing clause, showing that it acquired sensitivity to the syntactic innovation we call query-based attention (QBA) (21). It was structure of language, not just local co-occurrence statistics. used in the Google Neural system (22), a system that attained a sudden leap in performance and Scaling up to natural text. Elman’s task of predicting words attracted widespread public interest (23). based on context has been central to neural language modeling. We illustrate QBA in Fig.3 with the sentence John hit the However, Elman trained his networks with tiny, toy . ball with the bat. Context is required to determine whether For many years, it seemed they would not scale up, and bat refers to an animal or a baseball bat. QBA addresses this language modeling was dominated by simple n-gram models by issuing queries for relevant information. A query might and systems designed to assign explicit structural descriptions ask ‘is there an action and relation in the context that would to sentences, aided by advances in probabilistic computations indicate which kind of bat fits best?’ The embeddings of (17). Over the past 10 years, breakthroughs have allowed words that match the query then receive high weightings in networks to predict and fill in words in huge natural language the weighted attention vector. In our example, the query corpora. matches the embedding of hit closely and of with to some One challenge is the large size of a natural language’s extent; the returned attention vector captures the content vocabulary. A key step was the introduction of methods needed to determine that a baseball bat fits in this context. for learning word representations (now called embeddings) There are many variants of QBA; the figure presents a from co-occurrence relationships in large text corpora (18, 19). simple version. One important QBA model called BERT (25) These embeddings exploit both general and specific predictive uses both preceding and following context. In BERT, the

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 2 sentations for subsequent fine tuning to perform other tasks. A recent model called GPT-3 achieves impressive gains on sev- eral benchmarks without requiring fine-tuning (28). However, this model still falls short of human performance on tasks that depend on what the authors call "common sense physics" and on carefully crafted tests of their ability to determine if a sentence follows from a preceding text (31). Further, the text corpora these models rely on are far larger than a hu-

Fig. 3. Query-based attention (QBA). To constrain the interpretation of the word bat man learner could process in a lifetime. Gains from further in the context John hit the ball with the __, a query generated from bat is used to increases may be diminishing, and human learners appear to construct a weighted attention vector which shapes the word’s interpretation. The be far more data-efficient. The authors of GPT-3 note these query is compared to each of the learned representation vectors (RepVs) of the limitations and express the view that further improvements context words; this creates a set of similarity scores (Sims) which in turn produce a may require more fundamental changes. set of weightings (Ws, a set of positive numbers summing to 1). The Ws are used to scale the RepVs of the context words, creating Scaled RepVs. The weighted attention vector is the element-wise sum of the Scaled RepVs. The Query, RepVs, Language in an Integrated Understanding System Sims, Scaled RepVs and weighted attention vector use red color intensity for positive magnitudes and blue for negative magnitudes. Ws are shown as green color intensity. Where should we look for further progress addressing the White = 0 throughout. The Query and RepVs were made up for illustration, inspired by limitations of current language models? In concert with others (24). Mathematical details: For query q and representation vector vj for context word j, the similarity score sj is cos (q, vj ). The sj are converted into weightings wj by (32), we argue that part of the solution will come from treating (gsj ) (gsj0) the softmax function, wj =e /(Σj0e ), where the sum in the denominator language as part of a larger system for understanding and runs over all words in the context span, and g is a scale factor. communicating.

Situations. We adopt the perspective that the targets of under- network is trained to correct missing or randomly replaced standing are situations. Situations are collections of entities, words in blocks of text, typically spanning two sentences. their properties and relations, and patterns of change in them. BERT relies on multiple attention heads, each employing A situation can be static (e.g., a cat on a mat). Situations in- QBA (26), concatenating the returned weighted vectors to clude events (e.g., a boy hitting a ball). Situations can embed form a composite vector. The process is iterated across several within each other; the cat may be on a mat inside a house on stages, so that contextually constrained representations of a particular street in a particular town, and the same applies each word computed at intermediate stages in turn constrain to events like baseball games. A situation can be conceptual, the representations of every other word in later stages. In social or legal, such as one where a court invalidates a law. A this way, BERT employs mutual constraint satisfaction, as situation may even be imaginary. The entities participating in Rumelhart (11) envisioned. The contextualized embeddings a situation or event may be real or fictitious physical objects or that result from QBA can capture gradations within the set of locations; animals, persons, groups or organizations; beliefs or meanings a language maps to a given word, aiding translation other states of ; sets of objects (e.g., all dogs); symbolic and other down-stream tasks. For example, in English, we objects such as symbols, tokens or words; or even contracts, use ball for many types of balls, whereas in French, some are laws, or theories. Situations can even involve changes in beliefs balles and others ballons. Subtly different embeddings for ball about relationships among classes of objects (e.g. biologists’ in different English sentences aids selecting the correct French beliefs about the genus a species of trees belongs in). translation. What it means for an agent to understand a situation is In QBA architectures, the vectors all depend on learned to construct a representation of it that captures aspects of connection weights, and analysis of BERT’s representations the participating objects, their properties, relationships and shows that they capture syntactic structure well (24). Different interactions, and resulting outcomes. We emphasize that the attention heads capture different linguistic relationships, and understanding should be of as a construal or interpre- the similarity relationships among the contextually-shaded tation that may be incomplete or inaccurate. The construal word representations can be used to reconstruct a sentence’s will depend on the culture and context of the agent and the syntactic structural description. These representations capture agent’s purpose. When other agents are the source of the in- this structure without building it in, supporting the emergence put, the target agent’s construal of the knowledge and purpose principle. That said, careful analysis (27) indicates the deep of these other agents also play important roles. As such, the network’s sensitivity to grammatical structure is still imperfect, construal process must be considered to be completely open and only partially understood. ended and to potentially involve interaction between the con- Some attention-based models (28) proceed sequentially, struer and the situation, including exploration of the world and predicting each word using QBA over prior context, while discourse between the agent and participating interlocutors. BERT uses parallel processing or mutual QBA simultaneously Within this construal of understanding, we emphasize that on all the words in a pair of sentences. Humans appear language should be seen as a component of an understanding to exploit past context and a limited window of subsequent system. This idea is not new, but historically it was not univer- context (29), suggesting a hybrid strategy. Some machine sally accepted. Chomsky, Fodor and others (8, 33, 34) argued models adopt this approach (30), and below we adopt a hybrid that grammatical knowledge sits in a distinct, encapsulated approach as well. subsystem. Our proposal to focus on language as part of a Attention-based models have produced remarkable improve- system representing situations builds on a long tradition in ments on a wide range of language tasks. The models can be linguistics (35), human (36), psycholin- pre-trained on massive text corpora, providing useful repre- guistics (37), philosophy (38) and artificial intelligence (39).

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 3 The approach was adopted in an early neural network model (40) and aligns with other current perspectives in cognitive neuroscience (41) and artificial intelligence (32).

People construct situation representations. A person processing language constructs a representation of the described situation in real time, using both the stream of words and other available information. Words and their sequencing serve as clues to meaning (42) that jointly constrain the understanding of the situation (40). Consider this passage: John spread jam on some bread. The knife had been dipped in poison. We make many inferences: the jam was spread with the poi- soned knife and poison has been transferred to the bread. If John eats it he may die! Note the entities are objects, not words, and the situation could be conveyed by a silent movie. Evidence that humans construct situation representations from language comes from classic work by Bransford and colleagues (36, 43). This work demonstrates that (1) we un- derstand and remember texts better when we can relate the text to a familiar situation; (2) relevant information can come from a picture accompanying the text; (3) what we remember from a text depends on the framing context; (4) we represent Fig. 4. Sketch of the brain’s understanding system. Ovals in the blue box stand for neocortical brain areas representing different kinds of information. Arrows in the objects in memory that were not explicitly mentioned; and (5) neocortex stand for connections allowing representations to constrain each other. The after hearing a sentence describing spatial or conceptual rela- medial temporal lobe (red box) stores an integrated representation of the neocortical tionships, we remember these relationships, not the language system state arising from an experience. The red arrow represents fast-learning itself. For example, given Two frogs rested beside a floating log connections that store this pattern for later reactivation and use. Green arrows stand and a fish swam under it, the situation changes if it is replaced for gradually learned connections supporting bidirectional influence between the MTL and neocortex. (A) and (B) are two example inputs discussed in the main text. by them. After hearing the original sentence, people reject the variant with it in it as the sentence they heard before, but if the initial sentence said the frogs rested on the log, the while leaving them with the flexibility that has led to their situation is unchanged by replacing it with them, and people successes to date. accept this variant. Evidence from eye movements shows that people use lin- Language informs us about situations. Situations ground the guistic and non-linguistic input jointly and immediately (44). representations we construct from language; equally impor- Just after hearing The man will drink ... participants look tantly, language informs us about situations. Language tells us at a full wine glass rather than an empty beer glass (45). about situations we have not witnessed and describes aspects After hearing The man drank, they look at the empty beer that we cannot observe. Language also communicates folk or glass. Understanding thus involves constructing, in real time, scientific construals that shape listener’s construals, such as a representation conveyed jointly by vision and language. the idea that an all-knowing being took six days to create the world or the idea that natural processes gave rise to our world The compositionality of situations. An important debate in and ultimately ourselves over billions of years. Language can cognitive science and AI centers on compositionality. Fodor be used to communicate information about properties that and Pylyshyn (8) argued that our cognitive systems must be only arise in a social setting, such as ownership, or that have compositional by design to allow language to express arbitrary been identified by a culture as important, such as exact num- relationships and noted that early neural network models failed ber. Language thus enriches and extends the information tests of compositionality. Such failures are still reported (46) we have about situations and provides the primary medium leading some to propose building compositionality in (10); conveying properties of many kinds of objects and many kinds yet, as we have seen, the most successful language models of relationships. avoid doing so. We suggest that a focus on situations may enhance compositionality because situations are themselves Toward a Brain and AI Inspired Model of Understanding compositional. Suppose a person picks an apple and gives it to someone. A small number of objects and persons are focally Capturing the full range of situations is clearly a long-term involved, and the effects on other persons and objects are likely challenge. We return to this later, focusing first on concrete to be local. A sentence like John picked an apple and gave situations involving animate beings and physical objects. We it to Mary could describe this situation, capturing the most seek to integrate insights from cognitive neuroscience and relevant participants and their relationships. We emphasize artificial intelligence toward the goal of building an integrated that compositionality is predominant and approximate, not understanding model. We start with our construal of the universal or absolute, so it is best to allow for these matters understanding system in the human brain and then sketch of degree. Letting situation representations emerge through aspects of what an artificial implementation might look like. experience will help our models to achieve greater systematicity,

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 4 The understanding system in the brain. Our construal of the fine-grained static situations and short time scale events (here- human integrated understanding system builds on the princi- after micro-situations) that can be conveyed by a sentence like ples of mutual constraint satisfaction and emergence and with the boy hit the ball with a bat. These context representations the idea that understanding centers on the construction of arise in a set of interconnected areas primarily within the pari- situation representations. It is consistent with a wide range of etal lobes (41, 47). In recent work, brain imaging data is used evidence, some of which we review, and is broadly consistent to analyze the time-varying patterns of neural activity while with recent characterizations in cognitive neuroscience (41, 47). processing a temporally extended narrative. The brain activity However, researchers hold diverse views about the details of patterns that represent scenes extending over tens of seconds these systems and how they work together. (e.g., a detective searching a suspect’s apartment for evidence) We focus first on the part of the system located primarily are largely the same, whether the information comes from in the neocortex of the brain, as schematized in the large watching a movie, hearing or reading a narrative description, blue box of Fig.4. Together with input and output systems, or recalling the movie after having seen it (52, 53). Activa- this allows a person to combine linguistic and visual input tions in different brain areas track information at different to understand the situation referred to upon hearing a sen- time scales. Activity in modality-specific areas associated with tence, such as one containing the word bat, while observing a speech and visual processing follows the moment-by-moment corresponding situation in the world. It is important to note time course of spoken and/or visual information while activ- that the neocortex is very richly structured, with on the order ity in the network associated with situation representations of 100 well-defined anatomical subdivisions. However, it is fluctuates on a much longer time scale. During processing common and useful to group these divisions into subsystems. of narrative information, activations in these regions tends The ones we focus on here are each indicated by a blue oval in to be relatively stable within a scene and punctuated with the figure. One subsystem subserves the formation of a visual larger changes at boundaries between these scenes, and these representation of the given situation, and another subserves patterns lose their coherence when the narrative structure the formation of an auditory representation capturing the is scrambled (41, 53). Larger-scale spatial transitions (e.g., spatiotemporal structure of the co-occurring spoken language. transitions between rooms) also create large changes in neural The three ovals above these provide representations of more activity (47). integrative/abstract types of information (see below). Within each subsystem, and between each connected pair The role of language. Where in the brain should we look for of subsystems, the neurons are reciprocally interconnected via representations of the relations among the objects participating learning-dependent pathways allowing mutual constraint satis- in a micro-situation? The effects of brain damage suggest that faction among all of the elements of each of the representation representations of relations may be integrated, at least in part, types, as indicated by the looping blue arrows from each oval with the representation of language itself. Injuries affecting to itself and by the bi-directional blue arrows between these the lateral surface of the frontal and parietal lobes produce ovals. Brain regions for representing visual and auditory inputs profound deficits in the production of fluent language, but are well-established, and the evidence for their involvement in can leave the ability to read and understand concrete nouns a mutual constraint satisfaction process with more integrative largely intact. Such lesions produce intermediate degrees brain areas has been reviewed elsewhere (48, 49). Here we of impairment to abstract nouns, verbs, and modifiers, and consider the three more integrative subsystems. profound impairment to words like if and by that capture grammatical and spatial relations (54). This striking pattern Object representations. A brain area near the front of the tempo- is consistent with the view that language itself is intimately ral lobe houses neurons whose activity provides an embedding tied to the representation of relations and changes in relations capturing the properties of objects (50). Damage to this area (information conveyed by verbs, prepositions, and grammatical impairs the ability to name objects, to grasp them correctly markers). Indeed, the frontal and parietal lobes are associated for their intended use, to match them with their names or with representation of space and action (which causes change in the sounds they make, and to pair objects that go together, relations), and patients with lesions to the frontal and parietal either from their names or from pictures. This brain area is language-related areas have profound deficits in relational itself an inter-modal area, receiving visual, language and other reasoning tasks (55). We therefore tentatively suggest that the information about objects such as the sounds they make and understanding of micro-situations depends jointly on the object how they feel to the touch. Models capturing these findings and language systems, and that the language is intimately link (51) treat this area as the hidden layer of an interactive, re- to representation of spatial relationships and actions. current network with bidirectional connections to other layers representing different types of object properties, including the Complementary learning systems. The brain systems described object’s name. In these models, an input to any of these above support understanding of situations that draw on gen- other layers activates the corresponding pattern in the hidden eral knowledge as well as oft-repeated personal knowledge, but layer, which in turn activates the corresponding patterns in they do not support the formation of new that can the other layers. This supports, for example, the ability to be accessed and used at an arbitrary later time. This ability produce the name of an object from visual input. Damage depends on structures that include the hippocampus in the me- (simulated by removing neurons in the hidden layer) degrades dial temporal lobes (MTL; Fig.4, red box). While these areas the model’s representations, capturing the patterns of errors are critical for new learning, damage to them does not affect made by patients with the condition. general knowledge, acquired skills, or the ability to process language and other forms of input to understand a situation— Representation of context. There is a network of areas in the except when this depends on remote information experienced brain that capture the large-scale spatiotemporal context of briefly outside the immediate current context (56). These

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 5 findings are captured in the neural-network based complemen- tion – addressing it will benefit from a greater convergence of tary learning systems (CLS) theory (57–59), which holds that cognitive neuroscience and AI. Toward this goal, we sketch a connections within the neocortex acquire the knowledge that proposal for a brain and AI-inspired model. We rely on the allows a human to understand objects and their properties, to principles of mutual constraint satisfaction and emergence, the link words and objects, and to understand and communicate query-based attention architecture from AI and deep learn- about generic situations as these are conveyed through lan- ing, and the components and their interconnections in the guage and other forms of experience. According to this theory, understanding system in the brain, as illustrated in Fig.4. the learning process is necessarily gradual, allowing it to cap- We treat the system as one that receives sequences of ture the nuanced statistical structure of experience. The MTL picture-description (PD) pairs grouped into episodes that in provides a complementary fast-learning system supporting the turn form story-length narratives, with the language conveyed formation of new arbitrary associations, linking the elements by text rather than speech. Our sketch leaves important of an experience together, including the objects and language issues open and will require substantial development. Steps encountered in a situation and the co-occurring spatiotemporal toward addressing some of these issues are already being taken: context, as might arise in the situation depicted in Fig.4B, mutual QBA is already being used, e.g., in (65), to exploit where a person encounters the word numbat from both visual audio and visual information from movies. and language input. In our proposed model, each PD pair is processed by inter- It is generally accepted that knowledge that depends ini- acting object and language subsystems, receiving visual and tially on the MTL can be integrated into the neocortex through text input at the same time. Each subsystem must learn to a consolidation process (56). In CLS (58), the neocortex learns restore missing or distorted elements (words in the language arbitrary new information gradually through interleaved pre- subsystem, objects in the object subsystem) by using mutual sentations of new and familiar items, weaving it into the fabric QBA as in BERT, allowing each element in each subsystem of knowledge and avoiding interference with existing knowl- to be constrained by all of the elements in both sub-systems. edge (60). The details are subjects of current debate and Additionally these systems will query the context and memory ongoing investigation (61). subsystems, as described below. As in our example of the beer John left in the cooler, under- The context subsystem encodes a sequence of compressed standing often depends on remote information. People exploit representations of the previous PD pairs within the current such information during language processing (62), and patients episode. Processing in this subsystem would be sequential with MTL damage have difficulty understanding or producing over pairs, allowing the network constructing the current com- extended narratives (63); they are also profoundly impaired in pressed representation to query the representations of past learning new words for later use (64). Neural language models, pairs within the episode. Within the processing of a pair, the including those using QBA, also lack these capabilities. In context system would engage in mutual QBA with the object BERT and related models, bi-directional attention operates and language subsystems, allowing the language and object within a span of a couple of sentences at a time. Other models subsystems to indirectly exploit information from past pairs (28) employ QBA over longer spans of prior context, but there within the episode. is still a finite limit. These models learn gradually like the Our system also includes an MTL-like memory to allow human neocortical system, allowing them to capture structure it to use remote information beyond the current episode. A in experience and acquire knowledge of word meaning. GPT-3 neural network with learned connection weights constructs a (28) is impressive in its ability to use a word encountered for reduced description of the state of the object, language, and the first time within its contextual span appropriately in a context modules along with their inputs from vision and text subsequent sentence, but this information is lost forever when after processing each PD pair. Building on existing artificial the context is re-initialized, as it would be in a patient without systems with external memory (66, 67) this compressed vector the medial temporal lobe. Including an MTL-like system in is stored as a vector in a slot in an external memory. These future understanding models would address this limitation. states are then accessible to the cortical subsystems via QBA. The brain’s complementary learning systems may provide The system would employ a flexible querying scheme, such that a means to address the challenge of learning to use a word any subset of the object, language, or context representations encountered in a single context appropriately across a wide of an input currently being processed would contribute to range of contexts. Deep neural networks that learn gradually accessing relevant MTL representations. There contents would through many repetitions of an item that occurs in a single then be to all of the cortical subsystems using QBA. Thus, the context, interleaved with presentations of other items occurring appearance of a numbat in a visual scene would retrieve the in a diversity of contexts, do not show this ability (46). We corresponding language and context information containing its attribute this failure to the fact that the distribution of training name and the fact that it eats termites, based on prior storage experiences they receive conveys the information that the of the the compressed representation formed previously from target item is in fact restricted to its single context. Further the inputs in Fig.4B. research should explore whether augmenting a model like GPT- Ultimately, our model may benefit from a subsystem that 3 with an MTL-like memory would enable more human-like guides processing in all of the subsystems we have described. extension of a novel word encountered just once to a wider The brain has such a system in its frontal lobes; damage to range of contexts. this system leads to impairments in guiding behavior and cog- nition according to current task demands, and others advocate Next steps toward a brain and AI-inspired model. Given the including such a system in neural AI systems (68). We leave it construal we have described of the human understanding sys- to future work to consider how to integrate such a subsystem tem, we now ask, what might an implementation of a model into the model we have described here. consistent with it be like? This is a long-term research ques-

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 6 Enhancing understanding by incorporating interaction with concrete ones like path and its extended metaphorical use as the physical and social world. A complete model of the hu- the means to achieve goals (75). Embodied, simulation-based man understanding system will require integration of many approaches to meaning (76, 77) build on this observation to additional information sources. These include sounds, touch bridge from concrete to abstract situations via metaphor. They and force-sensing, and information about one’s own actions. posit that understanding words like grasp is linked to neural Every source provides opportunities to predict information of representations of the action of grabbing and that this cir- each type, relying on every other type. Information salient cuitry is recruited for understanding contexts such as grasping in one source can bootstrap learning and inference in the an idea. We consider situated agents as a critical catalyst other, and all are likely to contribute to enhancing composi- for learning about how to represent and compose tionality and addressing the data-inefficiency of learning from pertaining to spatial, physical and other perceptually imme- language alone. This affords the human learner an important diate phenomena—thereby providing a grounded edifice that opportunity to experience the compositional structure of the can connect both to brain circuitry for motor action and to environment through its own actions. Ultimately, an ability to representations derived primarily from language. link one’s actions to their consequences as one behaves in the world should contribute to the emergence of, and appreciation Conclusion for, the compositional structure of events, and provide a basis Language does not stand alone. The understanding system in for acquiring notions of cause and effect, of agency, and of the brain connects language to representations of objects and object permanence (69). situations and enhances language understanding by exploiting These considerations motivate recent work on agent- the full range of our multi-sensory experience of the world, based language learning in simulated interactive 3D environ- our representations of our motor actions, and our memory ments (70–73). In (74), an agent was trained to identify, lift, of previous situations. We believe next generation language carry and place objects relative to other objects in a virtual understanding systems should emulate this system and we room, as specified by simplified language instructions. At each have sketched an approach that incorporates recent machine time step, the agent received a first-person visual observation learning breakthroughs to build a jointly brain and AI in- and processed the pixels to obtain a representation of the scene. spired understanding system. We emphasize understanding This was concatenated to the final state of an LSTM that of concrete situations and argue that understanding abstract processed the instruction, and then passed to an integrative language should build upon this foundation, pointing toward LSTM whose output was used to select a motor action. The the possibility of one day building artificial systems that under- agent gradually learned to follow instructions of the form find stand abstract situations far beyond concrete, here-and-now a pencil, lift up a basketball and put the teddy bear on the bed, situations. In sum, combining insights from neuroscience and encompassing 50 objects and requiring up to 70 action steps. AI will take us closer to human-level language understanding. Such instructions require constructing representations based on language stimuli that support identification of objects and ACKNOWLEDGMENTS. This article grew out of a workshop relations across space and time and the integration of this organized by HS at Meaning in Context 3, Stanford University, information to inform motor behaviors. September 2017. We thank Janice Chen, Chris Potts, and Mark Seidenberg for discussion. HS was supported by ERC Advanced Importantly, without building in explicit object representa- Grant #740516. tions, the learned system was able to interpret novel instruc- 1. Rosenblatt F (1961) Principles of neurodynamics. Perceptrons and the theory of brain mech- tions. For instance, an agent trained to lift each of 20 objects, anisms. (Spartan Books, Cornell University, Ithaca, New York). 2. Rumelhart DE, McClelland JL, the PDP research group (1986) Parallel Distributed Process- but only trained to put 10 of those in a specific location could ing. Explorations in the Microstructure of Cognition. Volume 1: Foundations. (MIT Press, place the remaining objects in the same location on command Cambridge MA). with over 90% accuracy, demonstrating a degree of compo- 3. Rumelhart DE, McClelland JL (1986) On learning the past tenses of English verbs in Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 2: Psycho- sitionality in its behavior. Notably, the agent’s ego-centric, logical and Biological Models, eds. McClelland JL, Rumelhart DE, the PDP Research Group. multimodal and temporally-extended experience contributed (MIT Press, Cambridge MA), pp. 216–271. to this outcome; both an alternative agent with a fixed perspec- 4. Pinker S, Mehler J (1988) Connections and symbols. (MIT Press, Cambridge, MA). 5. MacWhinney B, Leinbach J (1991) Implementations are not conceptualizations: Revising the tive on a 2D grid world and a static neural network classifier verb learning model. Cognition 40(1-2):121–157. that received only individual still images exhibited significantly 6. Bybee J, McClelland JL (2005) Alternatives to the combinatorial paradigm of linguistic theory based on domain general principles of human cognition. The linguistic review 22(2-4):381– worse generalization. This underscores how affording neural 410. networks access to rich, multi-modal interactive environments 7. Seidenberg MS, Plaut DC (2014) Quasiregularity and its discontents: The legacy of the past can stimulate the development of capacities that are essen- tense debate. Cognitive science 38(6):1190–1228. 8. Fodor JA, Pylyshyn ZW, , et al. (1988) Connectionism and cognitive architecture: A critical tial for language learning, and contribute toward emergent analysis. Cognition 28(1-2):3–71. compositionality. 9. Marcus G (2001) The algebraic mind. (Cambridge, MA: MIT Press). 10. Lake BM, Ullman TD, Tenenbaum JB, Gershman SJ (2017) Building machines that learn and think like people. Behavioral and Brain Sciences 40:e253. Beyond concrete situations. How might our approach be ex- 11. Rumelhart DE (1977) Toward an interactive model of reading in Attention & Performance VI, tended beyond concrete situations to those involving relation- ed. Dornic S. (LEA, Hillsdale, NJ), pp. 573–603. ships among objects like laws, belief systems, and scientific 12. Taraban R, McClelland JL (1988) Constituent attachment and thematic role assignment in sentence processing: Influences of content-based expectations. Journal of memory and theories? Basic word embeddings themselves capture some language 27(6):597–632. abstract relations via vector similarity, e.g., encoding that 13. McClelland JL, Rumelhart DE (1981) An interactive activation model of context effects in letter justice is closer to law than peanut. Words are uttered in real perception: I. An account of basic findings. Psychological review 88(5):375. 14. Elman JL (1990) Finding structure in time. Cognitive Science 14:179–211. world contexts and there is a continuum between grounding 15. Gold EM (1967) Language identification in the limit. Information and control 10(5):447–474. and language-based linking for different words and different 16. Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach. Learn. 7(2/3):195–225. uses of words. For example, career is not only linked to other 17. Manning CD, Schütze H (1999) Foundations of statistical language processing. (Cambridge, abstract words like work and specialization but also to more MA: MIT Press).

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 7 18. Collobert R, et al. (2011) Natural language processing (almost) from scratch. Journal of language? A voxel-based lesion symptom mapping study. Brain and language 113(2):59–64. Machine Learning Research 12(Aug):2493–2537. 56. Milner B, Corkin S, Teuber HL (1968) Further analysis of the hippocampal amnesic syndrome: 19. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of 14-year follow-up study of h.m. Neuropsychologia 6(3):215 – 234. words and phrases and their compositionality in Proceedings of Neural Information Process- 57. Marr D (1971) Simple memory: A theory for archicortex. Philosophical Transactions of the ing Systems. pp. 3111–3119. Royal Society of London B: Biological Sciences 262(841):23–81. 20. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 58. McClelland JL, McNaughton BL, O’Reilly RC (1995) Why there are complementary learning 9(8):1735–1780. systems in the hippocampus and neocortex: Insights from the successes and failures of 21. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align connectionist models of learning and memory. Psychological review 102(3):419–457. and translate in 3rd International Conference on Learning Representations, ICLR 2015. 59. Kumaran D, Hassabis D, McClelland JL (2016) What learning systems do intelligent agents 22. Wu Y, et al. (2016) Google’s neural machine translation system: Bridging the gap between need? Complementary learning systems theory updated. Trends in cognitive sciences human and machine translation. CoRR abs/1609.08144. 20(7):512–534. 23. Lewis-Kraus G (2016) The great AI awakening. The New York Times Magazine. Published 60. McClelland JL, McNaughton BL, Lampinen AK (2020) Integration of new information in mem- on-line December 14, 2016; Accessed May 23, 2020. ory: new insights from a complementary learning systems perspective. Philosophical Trans- 24. Manning CD, Clark K, Hewitt J, Khandelwal U, Levy O (2020) Emergent linguistic structure in actions of the Royal Society B 375(1799):20190637. artificial neural networks trained by self-supervision. Proceedings of the National Academy 61. Yonelinas A, Ranganath C, Ekstrom A, Wiltgen B (2019) A contextual binding theory of Sciences. of episodic memory: systems consolidation reconsidered. Nature reviews neuroscience 25. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional 20:364–375. transformers for language understanding in Proceedings of the 2019 Conference of the North 62. Menenti L, Petersson KM, Scheeringa R, Hagoort P (2009) When elephants fly: Differential American Chapter of the Association for Computational Linguistics: Human Language Tech- sensitivity of right and left inferior frontal gyri to discourse and world knowledge. Journal of nologies, Volume 1 (Long and Short Papers). pp. 4171–4186. cognitive neuroscience 21(12):2358–2368. 26. Vaswani A, et al. (2017) Attention is all you need in Advances in Neural Information Process- 63. Zuo X, et al. (2020) Temporal integration of narrative information in a hippocampal amnesic ing Systems 30. (Curran Associates, Inc.), pp. 5998–6008. patient. NeuroImage 213:116658. 27. Linzen T, Baroni M (2020) Syntactic structure from deep learning. Annual Reviews of Linguis- 64. Gabrieli JD, Cohen NJ, Corkin S (1988) The impaired learning of semantic knowledge follow- tics 2020. ing bilateral medial temporal-lobe resection. Brain and cognition 7(2):157–177. 28. Brown TB, et al. (2020) Language models are few-shot learners. arXiv preprint 65. Sun C, Baradel F, Murphy K, Schmid C (2019) Contrastive bidirectional transformer for tem- arXiv:2005.14165. poral representation learning. CoRR arXiv:1906.05743. 29. Warren RM (1970) Perceptual restoration of missing speech sounds. Science 66. Weston J, Chopra S, Bordes A (2015) Memory networks in International Conference on 167(3917):392–393. Learning Representations. 30. Yang Z, et al. (2019) Xlnet: Generalized autoregressive pretraining for language understand- 67. Graves A, et al. (2016) Hybrid computing using a neural network with dynamic external mem- ing. CoRR abs/1906.08237. ory. Nature 538(7626):471–476. 31. Nie Y, et al. (2019) Adversarial NLI: A new benchmark for natural language understanding. 68. Russin J, O’Reilly RC, Bengio Y (2020) Deep learning needs a prefrontal cortex. ICLR work- arXiv preprint arXiv:1910.14599. shop on Bridging AI and Cognitive Science. 32. Bisk Y, et al. (2020) Experience grounds language. arXiv preprint arXiv:2004.10151. 69. Piaget J (1952) The origins of intelligence in children (m. cook, trans.). new york, ny, us. 33. Chomsky N (1971) Deep structure, surface structure, and semantic interpretation in Seman- 70. Hermann KM, et al. (2017) Grounded language learning in a simulated 3D world. arXiv tics: An interdisciplinary reader in philosophy, linguistics, and psychology, eds. Steinberg D, preprint arXiv:1706.06551. Jakobovits LA. (Cambridge University Press Cambridge), pp. 183–216. 71. Das R, Zaheer M, Reddy S, McCallum A (2017) Question answering on knowledge bases 34. Fodor JA (1983) The Modularity of Mind. (MIT press). and text using universal schema and memory networks in Proceedings of the 55th Annual 35. Lakoff G (1987) Women, Fire, and Dangerous Things. (The University of Chicago Press). Meeting of the Association for Computational Linguistic. pp. 358–365. 36. Bransford JD, Johnson MK (1972) Contextual prerequisites for understanding: Some in- 72. Chaplot DS, Sathyendra KM, Pasumarthi RK, Rajagopal D, Salakhutdinov R (2018) Gated- vestigations of comprehension and recall. Journal of verbal learning and verbal behavior attention architectures for task-oriented language grounding in Proceedings of the Thirty- 11(6):717–726. Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applica- 37. Crain S, Steedman M (1985) Context and the psychological syntax processor in Natural lan- tions of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Ad- guage , eds. Dowty, Karttunen, Zwicky. (Cambridge University Press). vances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 38. Montague R (1973) The proper treatment of quantification in ordinary English in Approaches eds. McIlraith SA, Weinberger KQ. (AAAI Press), pp. 2819–2826. to natural language: Proceedings of the 1970 Stanford workshop on and semantics, 73. Oh J, Singh S, Lee H, Kohli P (2017) Zero-shot task generalization with multi-task deep rein- eds. Hintikka J, Moravcsik J, Suppes P. (Riedel, Dordrecht), pp. 221–242. forcement learning in Proceedings of the 34th International Conference on Machine Learning- 39. Schank RC (1983) Dynamic memory: A theory of reminding and learning in computers and Volume 70. (JMLR. org), pp. 2661–2670. people. (Cambridge University Press). 74. Hill F, et al. (2020) Environmental drivers of systematicity and generalization in a situated 40. John MFS, McClelland JL (1990) Learning and applying contextual constraints in sentence agent in International Conference on Learning Representations, ICLR. comprehension. Artificial Intelligence 46(1):217 – 257. 75. Bryson J (2008) Embodiment vs. memetics. Mind and Society 7(1):77–94. 41. Hasson U, Egidi G, Marelli M, Willems RM (2018) Grounding the neurobiology of language 76. Lakoff G, Johnson M (1980) Metaphors we live by. (University of Chicago, Chicago, IL). in first principles: The necessity of non-language-centric explanations for language compre- 77. Feldman J, Narayanan S (2004) Embodied meaning in a neural theory of language. Brain hension. Cognition 180:135–157. and language 89:385–392. 42. Rumelhart DE (1979) Some problems with the notion that words have literal meanings in Metaphor and thought, ed. Ortony A. (Cambridge Univ. Press, Cambridge, UK), pp. 71–82. 43. Barclay J, Bransford JD, Franks JJ, McCarrell NS, Nitsch K (1974) Comprehension and se- mantic flexibility. Journal of Verbal Learning and Verbal Behavior 13(4):471 – 481. 44. Tanenhaus M, Spivey-Knowlton M, Eberhard K, Sedivy J (1995) Integration of visual and linguistic information in spoken language comprehension. Science 268(5217):1632–1634. 45. Altmann GT, Kamide Y (2007) The real-time mediation of visual attention by language and world knowledge: Linking anticipatory (and other) eye movements to linguistic processing. Journal of Memory and Language 57(4):502–518. 46. Lake B, Baroni M (2018) Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks in International Conference on Machine Learning. pp. 2873–2882. 47. Ranganath C, Ritchey M (2012) Two cortical systems for memory-guided behaviour. Nature reviews neuroscience 13:713. Review Article. 48. McClelland JL, Mirman D, Bolger DJ, Khaitan P (2014) Interactive activation and mutual con- straint satisfaction in perception and cognition. Cognitive science 38(6):1139–1189. 49. Heilbron M, Richter D, Ekman M, Hagoort P, de Lange FP (2020) Word contexts enhance the neural representation of individual letters in early visual cortex. Nature 11(1):1–11. 50. Patterson K, Nestor PJ, Rogers TT (2007) Where do you know what you know? The represen- tation of semantic knowledge in the human brain. Nature reviews neuroscience 8(12):976. 51. Rogers TT, et al. (2004) Structure and deterioration of semantic memory: A neuropsycholog- ical and computational investigation. Psychological review 111(1):205–235. 52. Zadbood A, Chen J, Leong Y, Norman K, Hasson U (2017) How we transmit memories to other : Constructing shared neural representations via . Cerebral cortex 27(10):4988–5000. 53. Baldassano C, et al. (2017) Discovering event structure in continuous narrative perception and memory. Neuron 95(3):709–721. 54. Morton J, Patterson K (1980) Little words – No! in Deep dyslexia. (Routledge and Kegan Paul London), pp. 270–285. 55. Baldo JV, Bunge SA, Wilson SM, Dronkers NF (2010) Is relational reasoning dependent on

McClelland et al.: Extending Machine Language Models toward Human-Level Language Understanding 8