Computational approaches to mind and brain:

How language is profoundly shaped by its neural substrate

Paul Smolensky

Cognive Science Department

Faculty of Human Sciences 9 Dec 2014 Macquarie University

Friday, April 24, 15 1 comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cogsci: transitioning to a hard science!

✦ critical to technology ✦ critical to an estab- that changes the world lished hard science

Artificial Intelligence — AI

Watson 2010

2 Friday, April 24, 15 2 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting timeWatson: for IBM’s AI system beat 2 all-time human champions in the Cogsci: transitioningclassic American quiz-game show Jeopardy! (2011) to a hard science!

✦ critical to technology ✦ critical to an estab- that changes the world lished hard science

Artificial Intelligence — AI

Watson 2010

3 Friday, April 24, 15 3 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting timeWatson: for IBM’s AI system beat 2 all-time human champions in the Cogsci: transitioningclassic American quiz-game show Jeopardy! (2011) to a hard science! Input: English text; output: English speech ✦ critical to technologyWilliam Wilkinson’s “An Account of the Principalities✦ critical of to Wallachia an estab- that changes the worldand Moldavia” inspired this author's most famouslished novel. hard science Artificial Intelligence —Who AI is Bram Stoker?

Watson 2010

2,880 POWER7 processor cores

16 TB RAM [$3M]

500 GB ~ 1M books/sec

4 Friday, April 24, 15 4 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Deep Neural Network (DNN) capabilities, 12/2014 Cogsci: transitioning Technical Report No. IDSIA-05-13 2 to a hard science!

✦ critical to technology ✦ critical to an estab- that changes the world lished hard science

Artificial Intelligence — AI 3755 Chinese characters; nearly human-level accuracy human error rate: 3.87% vs. 4.21% for DNN

Watson 2010Figure 1: First seven characters of the competition test set preprocessed in Matlab (first row) and OpenCV Deep Neural Networks 2010(second row).

2.2 Preprocessing Although Chinese has tens of thousands of different classes, HWDB 1.1 [8] contains only the 3755 most frequent ones. They require more complicated graphics than the 26 classes of Latin letters. This requires bigger images. Our experience with handwritten digits and Latin letters [3, 6] tells us that a 20 20 pixel Figure 2. Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input,⇥ followed by three rectangularlocally-connected13,323 image layers web and can two show fully-connectedphotos enough layers. details of Colors for 5,749 illustrategood recognition. feature celebrities; maps produced After at visual each layer. inspection photo The net includes of several pairs: more than Chinese 120 million charactersparameters, where rescaled more than to 95% various come from sizes the local we anddecided fully connected on using layers.40 40 pixel images. Scaling is done uniformly; ⇥ thesame biggest dimensionor different? of each character 97.35% determines the scaling[Facebook factor. We also AI place Research, scaled characters in the middle of 48 48 pixel images, to allow for various deformations during the training process. Before very few parameters.⇥ These layers merely expand the input The goal of training is to maximize the probability of resizing,intoTaigman a set of we simple maximize local et features. input al. image 2014] contrast to get valuesthe from correct 0 to class255. (face id). We achieve this by minimiz- The subsequent layers (L4, L5 and L6) are instead lo- ing the cross-entropy loss for each training sample. If k cally connected [13, 16], like a convolutional layer they ap- is the index of the true label for a given input, the loss is: 2.3ply a filter Preprocessing bank, but every location glitch in the at feature ICDAR map learns L = log p . The loss is minimized over the parameters k 5 Friday, April 24, 15 Oura different training set of and filters. testing Since framework different regions [4] is of designed an aligned for alreadyby computing preprocessed the gradient data, of thatL w.r.t. is, neither the parameters training and 5 image have different local statistics, the spatial stationarity by updating the parameters using stochastic gradient de- norassumption testing of involves convolution preprocessing. cannot hold. Instead, For example, dedicated ar- Matlabscent (SGD). programs The are gradients used are to preprocess computed by data standard when- back- evereas between necessary. the eyes For and Chinese the eyebrows characters, exhibit preprocessing very differ- is limitedpropagation to rescaling of the error the [25 images, 21]. One to a interestingfixed size, property plus simpleent appearance contrast and maximization. have much higher discrimination ability of the features produced by this network is that they are very comparedICDAR to areas requires between executables, the nose and hence the mouth. we rewrote In other preprocessingsparse. On routinesaverage, 75% in C++, of the using feature the components OpenCV in li- the brarywords, instead we customize of writing the architecture a new scaling of the function. DNN by lever- As we learnedtopmost the layers hard are way, exactly however, zero. Matlab This is mainly and OpenCV due to the scalingaging the routines fact that do our not input produce images are exactly aligned. the The same use resultsuse (Fig. of the 1), ReLU despite [10] activation using the function: same interpolationmax(0,x). This method.of local layers Characters does not in affect Fig. the 1 computational look alike, but burden the of ones preprocessedsoft-thresholding in OpenCVnon-linearity are is much applied grainier. after every Our con- feature extraction, but does affect the number of parameters volution, locally connected and fully connected layer (ex- executablesubject to training. also reversed Only because the order we haveof scaling a large and labeled contrastcept maximization. the last one), makingAs a consequence, the whole cascade our frameworkproduce highly wasdataset, trained we can and afford validated three large with locally one preprocessing connected layers. routine,non-linear while the and submitted sparse features. executable Sparsity used is also a different encouraged one.The use Since of locally the feedback connected from layers the (without organizers weight matched shar- ourby expectations, the use of a regularizationthe problem wasmethod noticed called only dropout once [19] theing) test can data also be was justified released by the after fact the that competition. each output unit of which sets random feature components to 0 during training. a locallyWhen connected we applied layer identical is affected preprocessing by a very large patch for both of trainingWe have and applied test set, dropout the test only error to the was first 4.21%, fully-connected down fromthe input. our Fororiginal instance, competition the output resultof L6 is of influenced 5.58%. byAll a our DNNlayer. Dueand to MCDNN the large had training their set, error we did rates not reduced, observe sig- 74x74x3 patch at the input, and there is hardly any statisti- 2 by up to 2%. nificant overfitting during training . cal sharing between such large patches in aligned faces. Given an image I, the representation G(I) is then com- Finally,Since we the alsotop two submitted layers (F7 an and executable F8) are fully with con- the sameputed flawed using preprocessing the described feed-forward to the 2011 network. competition Any feed- usingnected: the each same output test unit data, is connected we rechecked to all inputs. the 2011 These result,forward and also neural got network a 2.04% with lowerL layers, error can be rate seen (5.78% as a com- l insteadlayers are of able 7.82%). to capture correlations between features cap- position of functions g. In our case, the representation is: turedDespite in distant flawed parts of preprocessing the face images, wee.g. won, position the 2011 and competition.F7 ButL6 we lostC1 the 2013 competition by G(I)=g (g (...g (T (I,✓T ))...)) with the net’s pa- shape of eyes and position and shape of mouth. The output 0.35%, coming in 2nd at 5.58% vs. 5.23%. With correctrameters preprocessing, = C however,,...,F and we✓ get= 1.01%x , P,~ absolute~r as de- of the first fully connected layer (F7) in the network will be { 1 7} T { 2d } error rate reduction (a massive 19.3% in relative reduction)scribed over in the Section team2 which. ranked first. used as our raw face representation feature vector through- As a final stage we normalize the fea- out this paper. In terms of representation, this is in con- Normaliaztion tures to be between zero and one in order to reduce the sen- trast to the existing LBP-based representations proposed in sitivity to illumination changes: Each component of the fea- the literature, that normally pool very local descriptors (by ture vector is divided by its largest value across the training computing histograms) and use this as input to a classifier. L f(I):= The output of the last fully-connected layer is fed to a set. This is then followed by 2-normalization: G¯(I)/ G¯(I) where G¯(I) = G(I) / max(G , ✏) 3. K-way softmax (where K is the number of classes) which || ||2 i i i produces a distribution over the class labels. If we denote Since we employ ReLU activations, our system is not in- variant to re-scaling of the image intensities. Without bi- by ok the k-th output of the network on a given input, the probability assigned to the k-th class is the output of the 2See the supplementary material for more details. 3 softmax function: pk =exp(ok)/ h exp(oh). ✏ =0.05 in order to avoid division by a small number. P Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Deep Neural Network (DNN) capabilities, 12/2014 Cogsci: transitioning to a hard science! Google: uses DNNs in its: ✦ ✦ critical to technology Android speech recognizer critical to an estab- that changes the world lished hard science Google Goggles Artificial Intelligence — AI Google+ photo search service ... Watson 2010 Deep Neural Networks 2010Microsoft: Speech recognition service

“Nvidia builds CUDA GPU programming library for machine learning – so you don't have to … Craft a deep neural network on a graphics chipset”

6 Friday, April 24, 15 6 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

7 Friday, April 24, 15 7 Cognitive Science comes of age

Cognitive Science:Computational Characterize cognition approaches – the function to mind of theand brain brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

How language is profoundly shaped by its neural substrate

8 Friday, April 24, 15 8 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

9 Friday, April 24, 15 9 Symbolic cognitive models 1: Reasoning

The notion at the heart of symbolic cognitive modeling has a long and distinguished pedigree: at least as far back as Aristotle. (4c. BCE) ◆ If I tell you, “it’s a law of science that if p is true, then q is also true” and then I tell you, “it so happens that p is true” then you can respond … “then q is true”. But you have no idea what p and q ARE. How can you know q is true? It’s amazing that we can understand if p is true, then q is also true without knowing anything about p and q — understand it to reason with. Humans are symbolic animals. Our minds eat symbols for breakfast, lunch & dinner. So do computers. Can we put our mental symbolic world into a computer? That is the objective of Symbolic Artificial Intelligence — Symbolic AI Its success is part of what brought us Watson. 10

Friday, April 24, 15 10 Symbolic cognitive models 2: Language (grammar)

In language, like in reasoning, the historical foundation of symbolic cognitive modeling is deep. As far back as the Sanskrit grammarian Pāṇini (पािणन) in ancient India (4c BCE), symbolic rules have been used to state grammars with mathematical precision. (He had 3,959 rules just for constructing words.) In the modern era (since 1950), a primary founder of the symbolic theory of grammar is Noam Chomsky. In this approach, a grammar is a machine in a speaker’s mind that manipulates symbols like John, Noun, Verb Phrase, … to literally construct the sentences we produce as speakers. The following is a concrete example from the part of grammar linguists call : here, the rules govern how basic speech sounds like a, m, k, … can be assembled into possible words.

11

Friday, April 24, 15 11 Nerd slide — Symbolic grammar: phonology of Lardil

These (few) gray slides make muŋkumuŋkunmore technical = points that are not critical for followingstem the-suffix remainder of the talk.

12 Friday, April 24, 15 12 Nerd slide — Symbolic grammar: phonology of Lardil

Lardil How to say ‘wooden axe’?

If an object of the verb : (“make an axe”): muŋkumuŋkun = muŋkumuŋku-n stem-suffix If a subject of the verb : (“the axe broke”): muŋkumu From what general knowledge can the speaker compute this, and all the other forms?

13 Friday, April 24, 15 13 Nerd slide — Symbolic grammar: phonology of Lardil subject form: Lardil How to say ‘wooden axe’? σ muŋkumuŋku Onset Nucleus Coda If an object: σ σ σ σ /|\ /| /|\ /| Syllable-construcon rules muŋkumuŋkun = muŋkumuŋku-n muŋkumuŋku⚡ stem-suffix Subject rule: delete final Vowel σ σ σ σ⚡ If a subject: /|\ /| /|\ /| muŋkumuŋk Syllable-destrucng rules: muŋkumu ⇒ σ σ σ No Nucleus no σ From what general knowledge /|\ /| /|\ muŋkumuŋk⚡ can the speaker compute this, Delete unsyllabified σ σ σ segments and all the other forms? /|\ /| /|\⚡ muŋkumuŋ Syllable-destrucng rules: ŋ in Coda only if following k A grammar: σ σ σ A sequence of rules that /|\ /| /| muŋkumuŋ⚡ manipulate symbols Delete unsyllabified σ σ σ /|\ /| /| segments muŋkumu à la Wilkinson 1983 14 Friday, April 24, 15 14 Nerd slide — Symbolic grammar: phonology of Lardil subject form: Lardil How to say ‘wooden axe’? σ muŋkumuŋku Onset Nucleus Coda If an object: σ σ σ σ /|\ /| /|\ /| Syllable-construcon rules muŋkumuŋkun = muŋkumuŋku-n muŋkumuŋku⚡ stem-suffix Subject rule: delete final Vowel σ σ σ σ⚡ If a subject: /|\ /| /|\ /| muŋkumuŋk Syllable-destrucng rules: muŋkumu ⇒ σ σ σ No Nucleus no σ From what general knowledge /|\ /| /|\ muŋkumuŋk⚡ can the speaker compute this, Delete unsyllabified σ σ σ segments and all the other forms? /|\ /| /|\⚡ muŋkumuŋ Syllable-destrucng rules: ŋ in Coda only if following k A grammar: σ σ σ A sequence of rules that /|\ /| /| muŋkumuŋ⚡ manipulate symbols Delete unsyllabified σ σ σ /|\ /| /| segments muŋkumu à la Wilkinson 1983 15 Friday, April 24, 15 15 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events Watson’s inputs and outputs are in English: the success of symbolic linguistic theories was also important in bringing us Watson.

16 Friday, April 24, 15 16 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

17 Friday, April 24, 15 17 10 Paul Smolensky and Géraldine Legendre

Box 1. An example connectionist network Neural models

Output: Pronunciation p f ts … ey … … Segment string (61)

Hidden layer (100) …

Plaut, McClelland, Seidenberg & Paerson (1996) Psychological Review

Input: Orthography P H PH … EA … … CH Letter string (105) Onset Vowel Coda

This networkUsed (Plaut to et al.explain 1996) learns detailed to pronounce behaviors monosyllabic Englishobserved words. Inin the laboratory brief (andshowing, simplifying somewhat), e.g., that the netw wordsork works are as produced follows. The spelling more of aslowly when there are a word is presented to the network by setting (clamping) the values of its 105 input units, whichnumber are divided ofinto other three groups. frequent The first group words represents with the similarorthographic spelling but dissimilar onset of thepronunciations: word: the consonants preceding the first vowel-letter. For each letter se- quence spelling a single sound (a grapheme) that can appear in an English onset, there is a corresponding GOWN unit. So if (vs.the first GROWN, two letters are BLOWN,PH, the activation ...) value of the PH graphemeslower unit than is set to 1 (active), as are the activations of the P and the H units. The second group of input units provides a corresponding graphemic represen- MINT (vs. PINT) 18 tation of the first vowel sequence, and the third input group, a corresponding repre- sentationFriday, ofApril the 24, orthographic 15 coda, the final portion of the word (primarily conso- 18 nants, but also possibly including a final E). The layer of input units supports a local representation of individual letters and a (modestly) distributed representation of letter strings: the representation of PHONE involves the activity of multiple units, each of which participates in the representation of multiple strings. Once the input activation values—constituting the input vector—have been clamped, activation flows from them up to a layer of 100 hidden units; this layer func- tions as internal scratch paper for the network’s calculations. Activation then flows up from the hidden units to the output units, which host a representation of a pronuncia- tion for the word. The output representation is similar to the input representation, ex- cept that in the place of letters there are units of speech, segments (e.g., , the sound spelled CH in CHURCH).

Box 1 10 Paul Smolensky and Géraldine Legendre

Box 1. An example connectionist network Neural models

Output: Pronunciation p f ts … ey … … Segment string (61)

Hidden layer (100) …

Plaut, McClelland, Seidenberg & Paerson (1996) Psychological Review

Input: Orthography P H PH … EA … … CH Letter string (105) Onset Vowel Coda

This network (Plaut et al. 1996) learns to pronounce monosyllabic English words. In brief (and simplifying somewhat), the network works as follows. The spelling of a word is presented to the network by setting (clamping) the values of its 105 input units, which are divided intoBasic three gr oups.learning The first grouptechnique: represents the Back-propagation orthographic onset of the word: the consonants preceding the first vowel-letter. For each letter se- quence spelling a single sound (a grapheme) that can appear in an English onset, there is a corresponding unit. So if the first two letters are PH, the activation value of the PH grapheme unit is set to 1 (active), as are the activations of the P and the H units. The second group of input units provides a corresponding graphemic represen- 19 tation of the first vowel sequence, and the third input group, a corresponding repre- sentationFriday, ofApril the 24, orthographic 15 coda, the final portion of the word (primarily conso- 19 nants, but also possibly including a final E). The layer of input units supports a local representation of individual letters and a (modestly) distributed representation of letter strings: the representation of PHONE involves the activity of multiple units, each of which participates in the representation of multiple strings. Once the input activation values—constituting the input vector—have been clamped, activation flows from them up to a layer of 100 hidden units; this layer func- tions as internal scratch paper for the network’s calculations. Activation then flows up from the hidden units to the output units, which host a representation of a pronuncia- tion for the word. The output representation is similar to the input representation, ex- cept that in the place of letters there are units of speech, segments (e.g., , the sound spelled CH in CHURCH).

Box 1 Deep Learning (DL)

Basic learning technique: Back-propagation — supervised • Activation flows from input layer to output layer • Error values flow from output layer to input layer Increasing the network depth (number of hidden layers) can dramatically improve learning — given • enough training data & the only change • enough computing power since the 1980s! ➤ Google trains networks with billions of connections

20

Friday, April 24, 15 20 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

Essentially:Simple models back-prop that classify brain activation paerns into categories

21 Friday, April 24, 15 21 Vegetative or locked-in?

An early diagnostic test (one patient; Owen, Adrian. 2006. Science) • Ask patient inside an MRI machine personal questions (to which the scientists do not know the answer) ➤ do you have a brother? • Patient is told to respond, when hearing “Answer!”: ➤ for yes, “imagine playing tennis” ➤ for no, “imagine navigating your apartment” and to stop responding when hearing “Relax!” • Look for activity in either of two of the patient’s brain regions previously shown to associate with these two different mental activities Results • 5 of 6 questions: measurable response — all correct 22

Friday, April 24, 15 22 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

23 Friday, April 24, 15 23 ✦Human thought can cope intelligently with an unlimited set of situaons: producvity. ✦ Human knowledge allows (i) a novel situa- on to be broken down into familiar parts, (ii) the parts to be processed separately, & (iii) the results reassembled. ☞The combinatorial strategy for producvity. A mathemacal isomorphism links ✦ The ‘parts’ can be levels in the human mind/brain: ‣ symbols arranged into structures high-level virtual symbolic computer, ‣ distributed paerns of neural acvity. low-level neural network computer

Cognition is computation over Bits combinatorial structures

Combinatorial darker = more acvestructure of acvaon paerns?

24 Friday, April 24, 15 24 Nerd slide — Combinatorial structure in activation patterns Ⓜ

+ = ⓜ

◈ Frodo lives

= Frodo lives

25 Friday, April 24, 15 25 Nerd slide — Constituent construction Ⓜ Filler paern ⊗ Role paern Role: Isomorphism = ⓜ f2·r3 r3 Filler/role constuent paern

f2 ⊗ Tensor product Filler: representaons Frodo Frodo Ⓜ s = {fk/rk}

ⓜ s = Σk fk⊗rk

26 Friday, April 24, 15 26 Nerd slide — Ⓜ Symbolic virtual machine The isomorphism ⇅ ⓜ Neural network

ƒ Few leaders are admired by George Bush admire(George Bush, few leaders)

ƒ(s) = cons(ex1(ex0(ex1(s)), cons(ex (ex (ex (s))), ex (s))) Ⓜ 1 1 1 0 F G B D C ⓜ Patient V Output ƒ A P Meaning (Logical Form, LF) Tensor-Product This net Representaons exactly P Input computes Aux V by Ⓜ s = {fk/rk} ƒ ! A B ⇅ D C Aux F by Patient E G Passive sentence Agent W ⓜ s = Σk fk⊗rk

Friday, April 24, 15 27 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing Deep Nets with symbols 2020 2020 complex events

modelingNatural neuroimages Language (fMRI)Processing corresponding with DNNs to containingsentence meanings symbols

28 Friday, April 24, 15 28 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing Deep Nets with symbols 2020 2020 complex events

29 Friday, April 24, 15 29 ✦ ⓐ ⓜ-desideratum: ⓐ and ⓑ should not Computaon in certain neural networks –2 both be acve simultaneously builds an acvaon paern with maximal (if they are ⇒ Harmony penalty 2) micro-Harmony (conforming maximally to ⓑ the micro-desiderata encoded in the ⓜ-desideratum: ⓑ and ⓒ should +3 both be acve simultaneously connecon weights): the Hmax principle. ⓒ (if they are ⇒ Harmony reward 3) ✦ For tensor-product representaons, this mathemacally entails that outputs maximize macro-Harmony: Legendre, Miyata, & Smolensky 90, 91; ✦ the macro-desiderata in macro-Harmony Pater et al. 08 et seq. define a Harmonic Grammar.

Macro desiderata for sentence structure/: SUBJECT: A sentence has a subject. Grimshaw & Samek-Lodovici 95, 98 FULLINTERP: No meaningless words. Computation in grammar is optimization: knowledge = desiderata, not procedure ☜ ⇒ Radical new theory of grammar

30 Friday, April 24, 15 30 ✦ ⓐ ⓜ-desideratum: ⓐ and ⓑ should not Computaon in certain neural networks –2 both be acve simultaneously builds an acvaon paern with maximal (if they are ⇒ Harmony penalty 2) micro-HarmonyIf ⓐ & ⓒ both acve, at (conforming maximally to ⓑ: Conflict! ⓑ the micro-desiderataImpossible to sasfy both. encoded in the ⓜ-desideratum: ⓑ and ⓒ should Need conflict-resoluon: weighng. +3 both be acve simultaneously connecon weights): the Hmax principle. ⓒ (if they are ⇒ Harmony reward 3) ✦ For tensor-product representaons, this mathemacally entails that outputs maximize macro-Harmony: ✦ the macro-desiderata in macro-Harmony define a Harmonic Grammar.

Macro desiderata for sentence structure/syntax: SUBJECT: A sentence has a subject. Conflict! Impossible to sasfy both. FULLINTERP: No meaningless words. Need a conflict-resoluon method. How to express ? Computation in grammar English Italian Typology by re-weighng!is optimization: piove –3 –2 –3 –2 knowledge = desiderata, SUBJECT FULLINTERP H FULLINTERP SUBJECT H not procedure rains *! –3 * –2 ☜ ⇒ Radical new it rains * –2 ☜ *! –3 it rains it **! –4 *!* –6 theory of grammar

A realiscally complex case (in phonology) will be shown below 31 Friday, April 24, 15 31 ✦ ⓐ ⓜ-desideratum: ⓐ and ⓑ should not Computaon in certain neural networks –2 both be acve simultaneously builds an acvaon paern with maximal (if they are ⇒ Harmony penalty 2) micro-HarmonyIf ⓐ & ⓒ both acve, at (conforming maximally to ⓑ: Conflict! ⓑ the micro-desiderataImpossible to sasfy both. encoded in the ⓜ-desideratum: ⓑ and ⓒ should Need conflict-resoluon: weighng. +3 both be acve simultaneously connecon weights): the Hmax principle. ⓒ (if they are ⇒ Harmony reward 3) ✦ For tensor-product representaons, this mathemacally entails that outputs maximize macro-Harmony: ✦ the macro-desiderata in macro-Harmony define a Harmonic Grammar.

Macro desiderata for sentence structure/syntax: SUBJECT: A sentence has a subject. FULLINTERP: No meaningless words. How to express ? Computation in grammar English Italian Typology by re-weighng!is optimization: piove –3 –2 –3 –2 knowledge = desiderata, SUBJECT FULLINTERP H FULLINTERP SUBJECT H not procedure rains *! –3 * –2 ☜ ⇒ Radical new it rains * –2 ☜ *! –3 it rains it **! –4 *!* Weaker constraint is not –6 theory of grammar ‘turned off’ as in other theories A realiscally complex case (in phonology) will be shown below 32 Friday, April 24, 15 32 OT is a general formalism for grammar; applied in ✦ the macro-desiderata in macro-Harmony ✦ phonecs, phonology, morphology, syntax, define a Harmonic Grammar. Key variant: semancs, pragmacs, discourse ✦ Opmality Theory (OT) ✦ acquision, language change, bilingualism, ‣ macro-desiderata = constraints code-switching universal: the same in all languages Many computaonal theorems and soware for ‣ constraints are ranked not weighted compung outputs, typologies; learning Engl: SUBJECT ≫ FULLINTERP; Ital: reverse (actually used by working linguists) ☞ can formally compute the typology of Oen enables more insighul or explanatory possible human grammars from all analysis (not necessarily simpler) rankings of the universal constraints

Computation in grammar is optimization: knowledge = desiderata, not procedure ⇒ Radical new theory of grammar

33 Friday, April 24, 15 33 OT is a general formalism for grammar; applied in ✦ the macro-desiderata in macro-Harmony ✦ phonecs, phonology, morphology, syntax, define a Harmonic Grammar. Key variant: semancs, pragmacs, discourse ✦ Opmality Theory (OT) ✦ acquision, language change, bilingualism, ‣ macro-desiderata = constraints code-switching universal: the same in all languages Many computaonal theorems and soware for ‣ constraints are ranked not weighted compung outputs, typologies; learning Engl: SUBJECT ≫ FULLINTERP; Ital: reverse (actually used by working linguists) ☞ can formally compute the typology of Oen enables more insighul or explanatory possible human grammars from all analysis (not necessarily simpler) rankings of the universal constraints

OT gives new and formally precise answers to the most fundamental quesons of grammacal theory: What, exactly, is it that all human grammars share? Computation in grammar The constraints: is optimization: desiderata for combinatorial structure knowledge = desiderata, How, exactly, may human grammars differ? not procedure Only in the ranking (or weighng) of the constraints. ⇒ Radical new theory of grammar From sequenal symbol-manipulang procedures to achieving (simultaneous) desiderata: opmizaon (a real example) 34 Friday, April 24, 15 34 From sequenal symbol-rewring to Harmony maximizaon

Lardil How to say ‘wooden axe’? σ muŋkumuŋku Onset Nucleus Coda If object: σ σ σ σ /|\ /| /|\ /| Syllable-construcon rules muŋkumuŋkun = muŋkumuŋku-n muŋkumuŋku⚡ stem-suffix Subject rule: delete final Vowel σ σ σ σ⚡ If subject: /|\ /| /|\ /| muŋkumuŋk Syllable-destrucng rules: muŋkumu ⇒ σ σ σ No Nucleus no σ From what general knowledge /|\ /| /|\ muŋkumuŋk⚡ can the speaker compute this, Delete unsyllabified σ σ σ segments and all the other forms? /|\ /| /|\⚡ muŋkumuŋ Syllable-destrucng rules: ŋ in Coda only if following k A grammar: σ σ σ What form of computaon? /|\ /| /| muŋkumuŋ⚡ Delete unsyllabified σ σ σ And now for something /|\ /| /| segments completely different … muŋkumu à la Wilkinson 1983 35 Friday, April 24, 15 35 From sequenal symbol-rewring to Harmony maximizaon to Harmony maximizaon

Stem Nominave {undominated} *INSERT(V) NOM ALIGN *INSERT(C) *DELETE muŋkumuŋku ‘wooden axe’

muŋ.ku.muŋ.ku * ! muŋ.ku.muŋk *COMPLEX ! ... * *

muŋ.ku.muŋ *CODACOND ! * * * muŋ.ku.mu.ŋa * !

☞ muŋ.ku.mu ⊛ ⊛ ⊛ ⊛

muŋ.ku * * * * * ! *

wite ‘inside’ ☞ wi.te ⊛ wit *MINWD ! * *

ŗil ‘neck’ ŗil *MINWD ! ŗil.a *ONSET ! * *

ŗi.la * * ! ☞ ŗil.ta ⊛ ⊛

ŗil.tat * * * ! 36 Friday, April 24, 15 36 From sequenal symbol-rewring to Harmony maximizaon to Harmony maximizaon

Stem Nominave {undominated} *I NSERT(V) NOM ALIGN *INSERT(C) *DELETE muŋkumuŋku ‘wooden axe’

muŋ.ku.muŋ.ku * ! muŋ.ku.muŋk *COMPLEX ! ... * *

muŋ.ku.muŋ *CODACOND ! * * * muŋ.ku.mu.ŋa * ! ← Fatal here

☞ muŋ.ku.mu [there’s a beer opon]⊛ ⊛ ⊛ ⊛ muŋ.kuViolable constraints * * * * * ! * wite ‘inside’ ☞ wi.te ⊛ wit *MINWD ! * *

ŗil ‘neck’ ŗil *MINWD ! ŗil.a *ONSET ! * *

ŗi.la * * ! ☞ ŗil.ta ⊛ ← Tolerated here⊛

ŗil.tat * [there’s no beer opon]* * ! 37 Friday, April 24, 15 37 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing Deep Nets with symbols 2020 2020 complex events

38 Friday, April 24, 15 38 ✦Gradient Symbol Structures: Symbols in ✦ Example: Gradient input, discrete output structures have continuous activation ✦ French liaison values.

petit ami vs. petit copain vs. petite copine A in role blend: 0.7A 0.4A pe.ti.ta.mi pe.ti_.co.pain pe.tit.co.pine 0.7rleft + 0.4rright + 0.2B + 0.9C

Left child role filled by blend of symbols

peti(0.5*t) + (0.3*t + 0.3*n + 0.3*z)ami input GSC: novel continuous neural ⇓ optimize computation yields (discrete or pe.ti.ta.mi output gradient) optimal abstract combinatorial outputs. peti(0.5*t) + copain peti(1*t) + copine ⇒ ⇓ ⇓ Radical new theory of how abstract information pe.ti.co.pain pe.tit.co.pine is represented in the mind

Friday, April 24, 15 39 Cognitive Science comes of age

Cognitive Science: Characterize cognition – the function of the brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models to a hard science! (language) 1980 Neural models ✦ critical to technology that (no symbols) ✦ critical to an estab- changes the world 1990 Neural models lished hard science (with symbols) Artificial Intelligence — AI Optimality-based Brain reading grammars 2000 gross functions Watson (IBM) 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic Computing Deep Nets with symbols 2020 2020 complex events

40 Friday, April 24, 15 40 Cognitive Science comes of age

Cognitive Science:Computational Characterize cognition approaches – the function to mind of theand brain brain – precisely

A very exciting time for Cognitive Science Cogsci: transitioning 1950 Symbolic models ➊ Today’s topics to a hard science! (language) 1980 Neural models ➋ ✦ critical to technology (no symbols) ✦ critical to an estab- that changes the world 1990 Neural models ➌ lished hard science (with symbols) Artificial Intelligence — AI Optimality-based ➍ Brain reading grammars 2000 gross functions Watson 2010 2010 single objects Deep Neural Networks 2010 2010 Gradient Symbolic ➎ Computing 2020 complex events

How language is profoundly shaped by its neural substrate

41 Friday, April 24, 15 41 That’s all folks!

Thank you for your attention.

Friday, April 24, 15 42