<<

Biases in Segmenting Non-concatenative by Michelle Alison Fullwood B.A., Cornell University (2004) Submitted to the Department of and Philosophy in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Linguistics at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2018 Massachusetts Institute of Technology 2018. All rights reserved.

Signature redacted A u thor...... Department of Linguistics and Philosophy September 1, 2018 Signature redacted C ertified by ...... Adani'Albright Professor of Linguistics Thesis Supervisor

Accepted by ...... ISignature redacted MASSACHUSETTS INSTITUTE ' dam Albright OF TECHNOLOGY Professor of Linguistics SEP 2 7 2018 Linguistics Section Head

LIBRARIES ARCHIVES 2 Biases in Segmenting Non-concatenative Morphology by Michelle Alison Fullwood

Submitted to the Department of Linguistics and Philosophy on September 1, 2018, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Linguistics

Abstract Segmentation of words containing non-concatenative morphology into their com- ponent morphemes, such as Arabic /kita:b/ 'book' into root /ktb and vocalism /i-a:/ (McCarthy, 1979, 1981), is a difficult task due to the size of its search space of possibilities, which grows exponentially as word length increases, versus the linear growth that accompanies concatenative morphology. In this dissertation, I investigate via computational and typological simula- tions, as well as an artificial experiment, the task of morphological seg- mentation in root-and-pattern , as well as the consequences for majority- concatenative languages such as English when we do not presuppose concatena- tive segmentation and its smaller hypothesis space. In particular, I examine the necessity and sufficiency conditions of three biases that may be hypothesised to govern the learning of such a segmentation: a bias to- wards a parsimonious morpheme lexicon with a power-law (Zipfian) distribution over tokens drawn from this lexicon, as has successfully been used in Bayesian models of word segmentation and morphological segmentation of concatenative languages (Goldwater et al., 2009; Poon et al., 2009, et seq.); a bias towards con- catenativity; and a bias against interleaving morphemes that are mixtures of con- sonants and vowels. I demonstrate that while computationally, the parsimony bias is sufficient to segment Arabic verbal stems into roots and residues, typological considerations argue for the existence of biases towards concatenativity and towards separat- ing consonants and vowels in root-and-pattern-style morphology. Further evi- dence for these as synchronic biases comes from the artificial grammar experiment, which demonstrates that languages respecting these biases have a small but signif- icant learnability advantage.

Thesis Supervisor: Adam Albright Title: Professor of Linguistics

3 4 Acknowledgments

I write these acknowledgments with minutes to go to the official deadline, as one does, but I've been composing these in my head long before I ever wrote a word of this dissertation, because it took the help and kindness of many different people to get me to the point where I could even write a word. So here goes: Thanks go first and foremost to Adam Albright, who has been my advisor from my first day at MIT. I'm not sure how someone's worst quality can be that they're too nice, but Adam has managed it. After hearing innumerable horror stories about other people's advisorsi, I've come to realise how lucky I am to have ex- perienced his kindness, guidance, and wide-ranging knowledge of the minutiae of so many languages. I can only hope that as I go on to supervise other people in my working life, I can be half as patient and understanding as he has been with me through the years. Thanks also to the other members of my committee, Donca Steriade and Ed- ward Flemming, who were always there when I needed them, and provided much advice on linguistics, statistics, teaching, and life along the way. Michael Kenstow- icz taught me so much about through the classes I took from him and our one-on-one meetings. Tim O'Donnell was the one who started me on the project that grew into this dissertation, and I'm very grateful to him for teaching me almost everything I know about Bayesian statistics and their applications to . I'll not hide that I burned out, pretty badly, for a couple of years while work- ing on my dissertation. I'd like to thank MIT Mental Health and Counselling for giving me the tools to work through my anxiety and depression, and to all of my professors for being understanding and patient during the dark times. I'd also like to thank in this regard everyone at department HQ, particularly Jen, Matt, Mary and Chrissy, for bailing me out when I missed deadlines because things felt too impossible, and for always being there with smiles and chocolate. To any very lost

1All from other departments, natch.

5 grad students who might stumble upon this dissertation, if sitting down to work feels incapacitating, please talk to someone. No one I know has enjoyed writing their dissertation, but it shouldn't be an object of terror. Back to the good times: I loved sharing an office with Suyeon Yun, Yusuke Imanishi, Hrayr Khanjian, and Amanda Swenson, and the inimitable members of ling-10: Ayaka Sugawara, Coppe van Urk, Gretchen Kern, Isaac Gould, Ryo Ma- suda, Sam Steddy, Ted Levin, and Wataru Uegaki. Thanks also go to my extended family of officemates in Masako Imanishi, Yuika-chan, Joetaro-kun, and Sally Yun. Another bright spot was TAing for Donca Steriade and David Pesetsky. Those courses, and the bright undergrads who participated in them, rekindled my love for linguistics when it was flagging. It may be a trope that the real treasure is the friends you made along the way, but I am very glad that my rocky journey through MIT led, directly or indirectly, to me knowing Abdul-Raza Sulemana, Alya Abbott, Anna Jurgensen, Benjamin Storme, Brian Hsu, Bronwyn Bjorkman, Chew Lin Kay, Chrissy Wheeler, Christine Riordan, Claire Halpert, Danfeng Wu, Despina Oikonomou, Diviya Sinha, Erin Olson, Eva Csipak, Fen Tung, Gaja Jarosz, Giorgio Magri, Grace Chua, Hadas Kotek, Ishani Guha, Janice Lee, Jiahao Chen, Jonah Katz, Laine Stranahan, Lau- ren Eby Clemens, Lilla Magyar, Louis Tee, Luka Crnic, Maria Giavazzi, Michelle Yuan, Milena Sisovics, mitcho Erlewine, Nina Topintzi, Pamela Pan, Patrick Grosz, Patrick Jones, Pooja Paul, Presley Pizzo, Pritty Patel-Grosz, Ruth Brillman, Sam Al Khatib, Sam Zukoff, Sarah Ouwayda, Sixun Chen, Snejana Iovtcheva, Sunghye Cho, Yang Lin, Yasutada Sudo, Yimei Xiang, Yoong Keok, Youngah Do, Yujing Huang, Yukiko Ueda, among many others. Thanks also to the many who, wittingly or unwittingly, helped me procrasti- nate on my dissertation and kept me sane: my puzzle hunt crew, Chelsea, Tim, Jit Hin, Ahmed, Fabian, Nicholas, and Eric; fellow organisers and members of Boston Python, PyLadies Boston, and the Python community at large, including Adam Palay, Lynn Cherny, Liza Daly, Alex Dorsk, Eric Ma, Frederick Wagner, Janet Ri- ley, Jennifer Konikowski, Jenny Cheng, Laura Stone, Leena, Lina Kim, Marianne

6 Foos, and Thao Nguyen; my fellow mapmakers at Maptime Boston, particularly Jake Wasserman and Ray Cha; and fellow volunteers at the Prison Book Program. Muchas gracias also to colleagues at Vista Higher Learning, Cobalt , and Spirit Al, who not only helped me procrastinate but paid me to do it. Ever since I came to Boston, I've gone back home to Singapore once a year, where my friends have helped me decompress. I'd like to thank in particular the Cornell contingent (Yann Fang, Clara, Peishan, Pris, Ray, Wenshan), my CSIT "bros" (CKL, GKC, TYL, KKH, BS, OYS), and Janet. In Singapore also are my parents, to whom eternal thanks go for simply ev- erything. They had to leave school at 16 and 18 to start working and supporting their families, but that didn't stop them from having a start-up before start-ups were ever a thing, or from producing two daughters with PhDs either. Thank you for always ensuring we had opportunities to read and educate ourselves, while at the same time showing through your own example that you don't need higher education to be wise, and that you definitely don't need a degree to be loved. To my sister Melissa, I send absolutely no thanks at all for finishing her PhD in three years, though I guess at least that means that between us, we have a re- spectable average time to completion. Much love also to "the mob", i.e. my extended family in Singapore: Der Yao, Mema, UTH, Eleanor, Veronica, Kelvin, AWW, AMM, Uncle James, Aunty Siew Lan, Gabriel and Christine; as well as my in-laws, Candy, Dave, Amy, and all the Padowski aunts, uncles, and cousins. Lastly, we come to my boys: Greg and our purr machines Salemtand Mor- ris, who have been here for me through my best days and my worst days, and made all them brighter just by being there. My love for you is monotonically non- decreasing.

This material is based upon work supported by the National Science Founda- tion Graduate Research Fellowship Program under Grant No. 1122374.

7 8 Contents

1 Introduction 13 1.1 The problem of non-concatenative morphological segmentation ... 14 1.1.1 What about meaning? ...... 17 1.2 O u tlin e ...... 18 1.3 Models of Arabic morphology ...... 21 1.3.1 Root-based accounts ...... 22 1.3.2 Non-root-based accounts ...... 23 1.3.3 Evidence for the Semitic root ...... 26 1.3.4 Evidence for other morphemes in Semitic ...... 26 1.3.5 Our model ...... 27

2 Computational Modelling 31 2.1 Bayesian morphological segmentation ...... 31 2.1.1 The Goldwater-Griffiths-Johnson model ...... 33 2.1.2 Inference ...... 38 2.2 Non-concatenative generative model ...... 42 2.2.1 Inference ...... 43 2.3 Related work ...... 45 2.4 Experiments ...... 47 2.4.1 D ata ...... 47 2.4.2 Results ...... 49 2.4.3 Error analysis: Arabic ...... 50 2.4.4 Error analysis: English ...... 52

9 2.5 Discussion and Future Work ...... 53

3 Artificial Grammar Experiment 57 3.1 Pathological predictions ...... 58 3.1.1 Predictions of Prosodic Morphology ...... 60 3.1.2 Replacing gradient ALIGN with categorical ANCHOR . . 65 3.1.3 Predictions of Mirror Alignment-induced fixed rankings 70 3.1.4 Previous w ork ...... 72 3.2 Experiment ...... 75 3.2.1 Task ...... 75 3.2.2 Stimuli ...... 79 3.2.3 Controlling for alternative testing strategies ...... 85 3.2.4 Participants ...... 90 3.3 R esults ...... 91 3.3.1 Descriptive analysis ...... 91 3.3.2 Statistical analysis ...... 95 3.4 Discussion ...... 99 3.5 Conclusion ...... 101

4 Towards a new model 103 4.1 The concatenative bias ...... 103 4.1.1 Implementing the concatenative bias in the model ...... 104

4.1.2 Implementing the concatenative bias in Optimality Theory. . 106 4.1.3 Implementing the concatenative bias in MaxEnt models .. 106 4.2 C V bias ...... 107 4.2.1 Implementing the CV bias in the model ...... 109 4.2.2 Implementing the CV bias in Optimality Theory ...... 110 4.2.3 Biases: Implications and Future Work ...... 111 4.3 Spreading and truncation ...... 113 4.3.1 Implementing spreading and truncation in the model .. ... 116 4.3.2 Inference over this model ...... 117

10 4.4 M em ory limitations ...... 120 4.5 Conclusion ...... 121

A Formal specification of the generative model 123

B Artificial Grammar Experimental Stimuli 125

11 12 Chapter 1

Introduction

One of the many tasks faced by language learners is morphological segmentation, the task of breaking down complex words into their component morphemes. In the vast majority of languages, morphology is mostly concatenative, meaning that morphemes are joined together linearly as in the example of learners -+ learn + er + s. One class of languages, the Semitic family being the most prominent, displays a different kind of morphology, in which morphemes can be discontinu- ous and are interleaved together to form a stem. The following list of words from Arabic is a classic example. There is no contiguous sequence of segments that all the words share, yet they are all related. Instead, they share a discontinous subse- quence ktb.

(1) Partial list of Arabic words containing the discontinuous subsequence ktb

kataba 'he wrote' kutiba 'it was written' yaktaba 'he is writing' yuktabu 'it is being written' kattaba 'he caused to write' kuttiba 'it was caused to be written' kaataba 'he corresponded (with)' kuutiba 'it was corresponded (with)' kitaab 'book' kutub 'books' kaatib 'writer' kaatibuun 'writers' maktab 'office'

13 Traditionally, this triconsonantal subsequence has been analysed as a morpheme in its own right, called the "root" (though some theories disagree; this will be dis- cussed in further detail in section 1.3.2). If it is a morpheme, then the learner needs to be able to segment it out. How- ever, morphological segmentation with non-concatenative morphology is an expo- nentially more difficult task than with concatenative morphology, as the following section will demonstrate.

1.1 The problem of non-concatenative morphological segmentation

Since the seminal work of Saffran et al. (1996a), which showed that infants could use transition probabilities between phones to segment words, one prominent al- gorithm for how sentences can be segmented into words and words into mor- phemes has been to count these transition probabilities (TPs) and postulate bound- aries where the TP is low. Counting TPs between adjacent segments cannot account for non-concatenative morphology, however. Follow-up experiments by Newport and Aslin (2004) sug- gests that infants and adults can use vowel-to-vowel and consonant-to-consonant transition probabilities to perform word segmentation. Given that in the Semitic languages, the root is consonantal and interleaved with a vocalism, this is a pos- sible escape hatch for morphological segmentation. Such counting presupposes, however, knowledge about consonants and vowels. But to what extent is this knowledge necessary to perform successful morphological segmentation in non- concatenative languages? An alternative approach for word and morphological segmentation is not to concentrate on identifying boundaries, but to optimise the lexicon of words and morphemes that results. This is the general approach used in Bayesian and Min- imum Description Length approaches to segmentation (Goldwater et al., 2009;

14 Goldsmith, 2000). Such approaches look to place biases on the types of lexicon that results, requiring them to be parsimonious and, in the case of Bayesian seg- mentation, to have a power-law distribution over tokens. Experiments comparing the predictions of the Bayesian approach to the TP-based approach have shown that the predictions of Bayesian word segmentation better accord with the perfor- mance of human learners (Frank et al., 2010a,b). Most experiments in Bayesian morphological segmentation have concentrated on concatenative languages such as English, Mandarin Chinese, and Japanese (e.g. Xu et al. (2008); Mochihashi et al. (2009)). When Semitic languages have been ex- amined (e.g. Poon et al. (2009)), morphological segmentation was applied to un- diacriticised written forms of the language, which do not render short vowels, indicators of consonant doubling, or case endings. The words above thus will be presented as:

(2) Partial list of Arabic words in (1), undiacriticised representation

ktb 'he wrote' ktb 'it was written' yktb 'he is writing' yktb 'it is being written' ktb 'he caused to write' ktb 'it was caused to be written' ktb 'he corresponded (with)' ktb 'it was corresponded (with)' ktab 'book' ktb 'books' katb 'writer' katban 'writers' mktb 'office' Undiacriticised texts make Semitic languages look mostly concatenative. Where long vowels do intervene in the root as in ktab, this is not an issue for engineering- oriented approaches to morphological segmentation as these words have different base meanings. Now, suppose we want to tackle the full problem of morphological segmenta- tion on the Semitic languages, and we choose to take the Bayesian approach of us- ing biases on the lexicon to optimise our segmentation. Then we come up against a fundamental problem of non-concatenative morphology, which is that it presents

15 an extraordinarily large search space of solutions to segmentation. Suppose we know that a word with n segments consists of two morphemes, and that it has been formed purely concatenatively. Then the problem of mor- pheme segmentation reduces to figuring out where to place the boundary between the two morphemes. There are only n - 1 possibilities for its placement.

(3) cTattts

Let us be a little more concrete by casting the search space in terms of templates interspersing segments of morpheme 1 (represented by r) and morpheme 2 (repre- sented by s).

(4) Search space for n = 4, assuming concatenative morphology: there are 4 - 1 = 3 possibilities

a. rrrs b. rrss c. rsss

When we can choose to interleave two morphemes in any order, however, the num- ber of possibilities for how to segment the word goes up from n - 1 to 2 "1 - 1.

(5) Templatic search space for n = 4: there are 24-1 - 1 = 7 possibilities. a. rrrs b. rrss

C. rsss d. rrsr e. rsrr f. rsrs g. rssr

Not only are the non-concatenative templates a superset of the concatenative tem- 1Due to symmetry, sssr = rrrs.

16 plates, but as the length of the word increases, the number of possible segmen- tations of the word into morphemes increases exponentially if we assume non- concatenative morphology, versus linearly if the morphemes are concatenated.

(6) Size of the search space when the mechanism of word formation is concate- nation, versus interleaving.

Length Concatenation Interleaving 4 3 7 5 4 15 6 5 31 7 6 63

As in the transition probability-based approach, we can reduce this non-concatenative search space using our knowledge that roots are mostly consonantal. But how nec- essary is this assumption to successful morphological segmentation? A separate question that arises is what searching the larger non-concatenative space of solutions would do to languages with concatenative morphology, which can successfully be segmented to a high degree of accuracy when we restrict their search space to concatenation only. Will the Bayesian approach then fail, and if so, how can we implement a morphological learner that is robust enough to acquire both concatenative and non-concatenative root-and-template morphology? We will examine all these issues in this dissertation.

1.1.1 What about meaning?

Our models and experiments will involve no reference to meaning. All words are presented as individual units, with no information about possible relations between them such as shared meaning, shared features, or being members of the same paradigm. The only information available is the segments in each word. Firstly, this is because I am working within a tradition of unsupervised mor-

17 phological segmentation. So far, this work has largely concentrated on concatena- tive languages, and has, as mentioned above, been quite successful. In particular, Bayesian word and morphological segmentation represent the state-of-the-art in the field. If the bias towards parsimonious lexicons yields largely correct segmen- tations in concatenative languages, then we are interested in seeing if the same holds true in non-concatenative languages. If not, what additional biases do we need to incorporate for segmentation to be successful? Secondly, evidence from artificial grammar experiments in infants and adults shows that word and morphological segmentation occur early and are possible without access to meaning. In addition to the word segmentation results reviewed above, Mintz (2004) found that 15-month-old infants acquiring English are able to separate the verbal suffix -ing from pseudo-roots, while Marquis and Shi (2012) found that 11-month-olds acquiring French could segment an artificial suffix -u from a set of nonce roots, among others. Lastly, I note that there is a growing trend towards simultaneous modelling of multiple levels of , such as joint phonetic category induction and lexical learning in Feldman et al. (2013), with the addition of modelling pho- netic variability in Elsner et al. (2013). Such modelling almost always shows that joint learning boosts the performance of learning at each level, and thus a future direction for non-concatenative morphological learning would be to add meaning into the mix. However, for now, my focus is on what the necessary and sufficient biases are for segmenting roots and patterns in non-conctenative morphology.

1.2 Outline

This dissertation examines the problem of non-concatenative morphological seg- mentation, with a focus on root-and-pattern morphology of the sort most fa- mously seen in the Semitic languages. Particular questions I aim to answer are:

(7) a. What biases do we need to incorporate into our learners to handle Ara-

18 bic morphology? b. In particular, is a Bayesian learner that implements a parsimony bias sufficient to learn to segment Arabic morphology into roots and non- root material? (i) If not, what elements does it learn? (ii) Does it require a bias towards certain morpheme shapes, such as all-consonant roots and all-vocalic residues? c. What happens to English when we allow a morphological learner to explore non-concatenative segmentations? (i) Does an English learner require a bias towards concatenativity? (ii) Do English speakers in an artificial grammar experiment display biases towards concatenativity and against interleaving morphemes that consist of mixtures of consonants and vowels?

In Chapter 2, which is extended and updated from previously published work with Timothy O'Donnell (Fullwood and O'Donnell, 2013), I present results from a computational simulation of unsupervised segmentation of Arabic verbal stems. We adapt the Bayesian segmentation approach of Goldwater et al. (2009) to the non-concatenative case by switching out its generative model, a procedure by which words are generated, for one that can handle non-concatenative morphol- ogy.

This model is informed by existing accounts of Arabic morphology, which is the subject of the remainder of Chapter 1. After exploring both root- and non-root- based accounts of how Arabic words are structured, we introduce a simplified, pared-down root-based model, which is then adapted into a hierarchical Bayesian model in Chapter 2. The reason for this simplification is two-fold: one, to render the model computationally tractable; and two, to investigate what assumptions are necessary to learn templatic morphology. This is where we answer the question of the sufficiency of the parsimony bias and the question of a bias towards particular morpheme shapes. We will find that there is sufficient statistical information in

19 our corpus of Arabic verbs to learn to segment out the root, without requiring any additional biases in the form of additional linguistic knowledge such as the consonant-vowel distinction. In Chapter 2, we not only look at how well the model does at learning Arabic morphology, but also at how well it does on an equivalent set of English verbal stems, when the hypothesis space is expanded to include non-concatenative seg- mentations. We find that the model performs poorly on English, despite theoret- ically being capable of handling both concatenative and non-concatenative mor- phology. This suggests that for English, the assumption that the primary mode of word formation is concatenation is crucial for proper segmentation, and that learners may have a bias towards concatenativity. In Chapter 3, I look further at the problem of concatenativity, as well as ques- tions raised by overgeneration of unattested templates by the model. I show that these pathological predictions are not merely the result of a linguistically deficient model, but surface in Prosodic Morphology (McCarthy and Prince, 1993b) as well. I report an Artificial Grammar experiment designed to test for synchronic biases for concatenativity and preference for consonantal roots and vocalic residues in root-and-pattern morphology via three "Martian languages".

(8) a. English-like: CVC-CV b. Arabic-like: CVCCV where root=CCC, residue=VV c. Unattested: CVCCV where root=CCV (1st, 3rd, 5th segment) and residue=VC (2nd, 4th segment)

root root root

(8a) C1V 1C2 C3 V2 (8b) C 1V 1 C2 C3V2 (8c) C1V 1 C2 C3V 2 V 1 residue residue residue

We find evidence from this experiment for synchronic biases towards concate- navity and division of the stem into consonantal roots and vocalic residues, which would reduce the degree of pathological overgeneration of unattested patterns.

20 The final chapter, Chapter 4, discusses linguistic issues that arise from the re- sults of Chapters 2 and 3, and looks at how to integrate the findings of Chapter 3 into the model, specifically, how the model may be modified to incorporate the proposed biases.

1.3 Models of Arabic morphology

Descriptively, Arabic verbs are formed from a set of consonants, called the "root", which conveys a semantic field such as "writing", and a vocalism that conveys voice (active/passive) and aspect (perfect/imperfect). The verbal stem is formed by interleaving the root and vocalism, and sometimes adding further derivational prefixes or infixes. To this stem inflectional affixes indicating the subject's person, number and gender are concatenated. There are nine common forms of the Arabic verbal stem, also known by the Arabic grammatical term wazn and the Hebrew grammatical term binyan. In (9), /IMf?represents the triconsonantal root. Only the perfect forms are given.

(9) List of common Arabic verbal binyanim

Form Active Passive

IV fa~al fuTil

II faTTal fuTil III faaTal fuuTil IV ?afial ?ufTil V tafaTial tufuil VI tafaaTal tufuuTil VII ?infaTal VIII ?iftaTal ?iftiTil X ?istafhal ?istufTil

Each of these forms has traditionally been associated with a particular .

21 For example, Form II verbs are generally causatives of Form I verbs, as is kattab "to cause to write" (c.f. katab "to write"). However, as is often the case with deriva- tional morphology, these semantic associations are not completely regular: many forms have been lexicalized with alternative or more specific meanings.

1.3.1 Root-based accounts

The traditional Arab grammarians' account of the Arabic verb was as follows: each form was associated with a template with slots labelled C 1, C2 and C 3, traditionally represented with the consonants V fil, as described above. The actual root conso- nants were slotted into these gaps. Thus the template of the Form V active perfect verb stem was taCiaC 2C 2aC3 . This, combined with the triconsonantal root, made up the verbal stem.

(10) Traditional analysis of the Arabic Form V verb

Root f 1I

Template t a C1 a C2C2 a C 3

The first generative linguistic treatment of Arabic verbal morphology (McCarthy, 1979, 1981) adopted the notion of the root and template, but split off the deriva- tional prefixes and infixes and vocalism from the template. Borrowing from the technology of autosegmental phonology (Goldsmith, 1976), the template was now comprised of C(onsonant) and V(owel) slots. Rules governing the spreading of segments ensured that consonants and vowels appeared in the correct positions within a template.

22 Under McCarthy's model, the analysis for [tafaTal] would be as follows:

(11) McCarthy analysis of Arabic Form V verb

Prefix t Root f T 1 I A CV Template C V C V C C V C

a Vocalism

While increasing the number of morphemes associated with each verb, the Mc- Carthy approach economized on the variety of such units in the lexicon. The in- ventory of CV templates was limited to those generated by the following rules:

(12) Rules generating (and limiting) verbal templates (McCarthy, 1981)

a. (C(V))CV([+seg])CVC b. V -+ 0 [CVC _ CVC]

Further, there were only three vocalisms, corresponding to active and passive voice intersecting with perfect and imperfect aspect; and only four derivational prefixes (/?/,/n/,/t/,/st/), one of which becomes an infix via morphophonological rule in Form VIII.

1.3.2 Non-root-based accounts

Hockett (1954) divides morphological theories into two categories, item-and-arrangement and item-and-process. The theories above fall more or less into the item-and- arrangement side of the divide. However, there are also theories of word formation in Semitic that fall more into the item-and-process category, wherein some word in a paradigm is the base, which is then morphed into other members of the paradigm by some grammatical process. Because the relation here is word-to-word, there generally is no need to posit the existence of a root.

23 I will discuss two such non-root accounts of Semitic morphology in particular, Bat-El (1994), in which the grammatical process is stem modification and vowel overwriting, and McCarthy and Prince (1990), in which iambic prosodic templates are overlaid onto singular Arabic nouns to produce broken plurals.

These accounts are both motivated by transfer phenomena, in which a property of the base transfers over to the derived form. Bat-El (1994) looks at cluster transfer in Hebrew denominal verbs. In such verbs, if they have 5 or more consonants, there are several possible CV templates they can map to. However, which one is chosen cannot be determined from the root alone. Instead, the template that maintains clusters present in the base noun is selected.

(13) Clusters are maintained between noun bases and derived denominal verbs in Hebrew (Bat-El, 1994)

a. /praklit/ 'lawyer' -s /priklet/ 'to practice law' b. /traklin/ 'salon, parlour' -+ /triklen/ 'to make something neat' c. /xantarif / 'nonsense' -+ /xintref/ 'to talk nonsense'

d. /?abstrikti/ 'abstract' -+ /?ibstrekt/ 'to abstract' e. /stenograf/ 'stenographer' -4 /stingref/ 'to take shorthand'

Bat-El (1994) proposes the following process for converting nouns into denominal verbs: first, a bisyllabic template is imposed on the noun. As much of the segmen- tal material of the noun as can be mapped onto those syllables, working from the outside edge in, is syllabified. Afterwards, the vocalic pattern /i,e/ overwrites the original vowels. Lastly, any unsyllabified material is erased.

24 (14) Bat-El (1994)'s three-step process for deriving Hebrew denominal verbs

c- 0- a. Maximal syllabification workingedge-in x a n t a r i f b. Vocalic overwriting i e

c. Delete unsyllabified segments _ x i n t r e f

Clusters are maintained under this framework because there is no mechanism for vowels to be epenthesised between them at any step of the process. No refer- ence to a root is necessary at any point in the derivation. Similarly, McCarthy and Prince (1990) are concerned with the transfer of vowel length in the final syllable of Arabic nouns in the singular and plural form.

(15) a. /xaatam/ -+ /xawaatim/ 'signet ring(s)' b. /jaamuus/ -+ /jawaamiis/ 'buffalo'

McCarthy and Prince (1990) propose that in the majority of Arabic broken plurals, an iambic template is imposed on the initial syllable of the singular noun, and a vowel overwriting process imposes the vocalic melody /i/ on the final syllable, maintaining the existing length. Again, no reference to the root is necessary at any point in this derivation. In and of themselves, these are not necessarily arguments against the root. In Optimality Theory, there are input-output faithfulness constraints and output- output faithfulness constraints (Benua, 1997). The transfer phenomena described here could be the result of phonological output-output faithfulness constraints, applied to words formed morphologically from a root-and-template-based inter- leaving. Furthermore, Aronoff (1994) studies several base-derivative relations in Hebrew and concludes that it is impossible to stipulate any given form as the base, disputing accounts of transfer phenomena that rely on derivations from a base form.

25 1.3.3 Evidence for the Semitic root

Despite being contested in Ratcliffe (1997); Bat-El (1994, 2003); Ussishkin (2003) and others, the root retains a healthy status in modern linguistic theory, because of a considerable body of evidence, largely external, that has built up in favour of its existence. Prunet (2006) gives a comprehensive overview of this corpus of evidence; here I will briefly sketch a few key pieces. Much evidence has come from priming experiments, showing that a word with a given root primes other words sharing the same root, to a greater degree than would be expected due to phonological or semantic similarity (Frost et al., 1997; Boudelaa and Marslen-Wilson, 2004, et seq.). Furthermore, evidence from metatheses in language games (Bagemihl, 1989; McCarthy, 1981, et seq.), slips-of-the-tongue (Berg and Abd-El-Jawad, 1996), and aphasic errors (Prunet et al., 2000, et seq.) shows that when the consonants of a Semitic stem are metathesised, it is invariably the root consonants that get metathe- sised and not consonants belonging to the template or affix. All this suggests that roots are psychologically real units of the mental lexicon, on a separate level from the template and affixes. On the basis of this evidence, the linguistic model we will adopt is derived from the root-based approaches outlined in 1.3.1.

1.3.4 Evidence for other morphemes in Semitic

Much of the processing literature addressing Semitic morphology has focused on investigating the psychological reality of the root, as opposed to the other mor- phemes proposed by (McCarthy, 1979, 1981), namely the vocalism and the CV- template. One exception to this is Boudelaa and Marslen-Wilson (2004), a psycholinguis- tic study using the priming paradigm that tests the place of these two morphemes in the organisation of the Arabic mental lexicon. Boudelaa and Marslen-Wilson find that words with a certain CV-skeleton primed other words sharing the same

26 CV-skeleton, even when they have few to no segments in common. Conversely, sharing a vocalism does not cause priming.

(16) a. /nuuqifJ/ "be discussed" (CV-skeleton: CVVCVC) primes /saa~ad/ "help" (CV-skeleton: CVVCVC) b. /3aazaf/ "act blindly" (Vocalism: /a/) does not prime /?inda~ar/ "be wiped out" (Vocalism: /a/) (Boudelaa and Marslen-Wilson, 2004)

Faust (2012), working in the framework of Distributed Morphology, suggests that the vocalism, at least in Modern Hebrew, should be split into two positions, V1 be- tween the first two consonants of the root and V2 between the last two consonants of the root, with each position having different DM realisation rules.

Our model will remain agnostic with respect to the status of the vocalism, and indeed will not attempt to model affixation either. Instead, we will combine all non-root elements of the stem into what we term the "residue", as the following section describes. We retain the notion of a template that dictates how the root and residue are interleaved, though it will not be a CV-skeleton; nevertheless, it serves the same purpose and could induce the same priming effect found in Boudelaa and Marslen-Wilson (2004).

1.3.5 Our model

Our model of root-and-template morphology adopts McCarthy's notion of an ab- stract template, but reduces the number of segment-bearing tiers to just two: the root and the residue. Their interleaving is dictated by a template with slots for root and residue segments. We also consider this a morpheme.

27 (17) Breakdown of Form VIII verb ?iftaTal under our approach:

Root f T 1

Template s s r s s r s r

Residue ? i t a a

Clearly, there are deficiencies to this rather simplistic model of Arabic morphol- ogy. Not only are prefixes and suffixes not modelled, but other crucial elements of the McCarthy analysis such as spreading, when there are more consonantal slots than consonants and more vocalic slots than vowels, are not modelled. Thus, while under the McCarthy analysis the vocalism of katab and kaatab is /a/, with spreading of the single segment to all two or three vocalic slots of the CV template, in this model they are /aa/ and /aaa/ respectively, with no sharing taking place. Similarly, Arabic doubled roots such as smm, which are biconsonantal under McCarthy's analysis, have to be modelled as underlyingly triconsonantal. Nor do we model segmental alternations, which occur when one or more of the consonants of the root is a glottal stop or semi-vowel. Thus, in this first pass at Arabic morphology, we deal only with the "sound" roots of Arabic - those with three consonants that do not induce segmental alternations, and we leave the modelling of spreading to future models. Furthermore, though this model was inspired by root-based approaches, there is nothing in particular forcing the root to be one of the morphemes that is found. Since there is no concept of consonants or vowels built into the model, in learning structure from words, a learner may posit a "root" that consists of a mixture of con- sonants and vowels. What is found in our computational experiment in Chapter 2 will be driven entirely by the statistical strength of various possible morphological segmentations. While this model is clearly inadequate to capture the nuances of Arabic or any other language's morphology - lacking the additional prefixes and suffixes that transform a stem into a word, for example, or any concept of spreading, which is

28 crucial to the analysis in McCarthy (1981), this is a deliberate choice: our goal in performing this simulation is to evaluate how much machinery a model needs to successfully learn the commonly-accepted units of Arabic words.

Though the model above is inspired by templatic morphology, concatenative morphology can theoretically also be captured in this framework by grouping all the root segments on one side and all the residue segments on the other.

(18) Breakdown of English verb cooked [kukt] under our approach:

Root ku k

Template r r r s

Residue t

We will use this in Chapter 2 to examine what happens to English morpholog- ical segmentation when we allow for non-concatenative segmentations.

29 30 Chapter 2

Computational Modelling

In this chapter, we perform a computational modelling experiment on unsuper- vised morphological segmentation on Arabic and English verbs, based on the the- oretical model of Arabic morphology presented in 1.3.5. The overall goal of this computational experiment is to figure out whether this somewhat impoverished model, which lacks any prior knowledge of possible structures of existing templates, or even about the distinction between consonants and vowels, can successfully segment Arabic, or whether some amount of prior knowledge is necessary. In addition, we test whether English can be successfully segmented when we expand its hypothesis space to include non-concatenation. This experiment is situated within the general approach of Bayesian segmen- tation (Goldwater et al., 2009), which recasts segmentation as the task of finding an optimal lexicon, where optimality is defined by a probabilistic model that we build up in accordance with our intuitions about what constitutes a "good" seg- mentation.

2.1 Bayesian morphological segmentation

Suppose we are given the following dataset of words and a choice of three mor- phological segmentations:

31 (1) Three possible segmentation hypotheses for dataset D

Hypothesis 1 Hypothesis 2 Hypothesis 3 t-r-a-i-n-e-d trained train-ed t-e-s-t-e-d tested test-ed t-r-a-i-n-i-n-g training train-ing t-e-s-t-i-n-g testing test-ing u-n-t-r-a-i-n-e-d untrained un-train-ed

The core insight of Bayesian segmentation is that a segmentation automatically induces a morpheme lexicon, some of which are more likely than others.

(2) Three morpheme lexicons L induced by the segmentations in (1)

Lexicon 1 Lexicon 2 Lexicon 3 t trained train r tested test a training ed i testing ing n untrained un

We can define a hierarchicalgenerative model of how morphemes and words are constructed. This is a probabilisticmodel, with a probability distribution placed on every choice we make in constructing each morpheme and word. Each hypothesis will then have a concrete probability that is product of the probability of each of these choices. Hypothesised lexicons associated with over- and under-generation as in Hypothesis 1 and Hypothesis 2 will receive a lower probability under the model, while lexicons that more closely match our intuitions about segmentation as in Segmentation/Lexicon 3 will receive a higher probability.

32 2.1.1 The Goldwater-Griffiths-Johnson model

Let us begin with the generative model of the Goldwater-Griffiths-Johnson (hence- forth GGJ) approach to word segmentation (Goldwater et al., 2009), adapted to concatenative morphological segmentation by Lee (2012).1 We build up the morpheme lexicon and word dataset as follows. For each word, we decide how many morphemes n it will contain by drawing a value from a Poisson distribution with a small A parameter. This has the following shape:

(3) Poisson distribution P(x) = for A 1.

0.4

0.3

0.2

0.1

0 2 4 6 Number of morphemes in word

Notice how quickly the probability drops as the number of morphemes gets large. This reflects our assumption that words should not consist of a large num- ber of morphemes, putting lower probability on that over-segment the dataset. Suppose we sample n as 2. Our word will then be composed of two mor- phemes, and we now need to decide what forms these will take. For each of the n morphemes that will make up the word, we draw a lexical entry from a morpheme lexicon. Every time we draw a token from the lexicon, we will have a choice of either (a) drawing an existing morpheme from the current lexicon, or (b) creating an entirely

IAside from replacing utterances with words and words with morphemes, the primary change is the addition of a Poisson distribution on the number of morphemes, based on work by Xu et al. (2008); Poon et al. (2009); Snyder and Barzilay (2008); Mochihashi et al. (2009); Lee et al. (2011).

33 new morpheme and adding to to the existing lexicon. Each of these is associated with a probability:

(4) a. Draw an existing morpheme mn with probability " where: (i) yi is the number of times the morpheme has previously been drawn, i.e. its token frequency (ii) N is the total number of morphemes already drawn (iii) a, b are hyperparameters on the distribution; 0 < a < 1 and b > -a.

b. Draw a new morpheme with probability aKfb, where: (i) K is the existing size of the morpheme lexicon (ii) N, a, b are as defined above.

What I have just described is a probability distribution commonly used in Bayesian non-parametric statistics called the Pitman-Yor Process (Pitman and Yor, 1995). Let us briefly verify that this is indeed a probability distribution in that the sum of these probabilities adds up to 1:

(5) Pitman-Yor probabilities sum up to 1:

K i - a aK+ b K I K a aK+ b N+b N+b =N+b N+b N+b N Ka aK+b N+b N+b N+b N-aK+aK+b N+b -1

As a concrete example of drawing a morpheme from the Pitman-Yor process, consider the fourth word in our dataset D, testing. Suppose we have already cho- sen our first three words as in Segmentation 3, so that the existing state of the morpheme lexicon is as follows:

34 (6) State of the morpheme lexicon having already selected the words train-ed, test-ed, train-ing (N = 6, K = 4).

Morpheme Existing tokens Current probability of drawing this morpheme

train trained, training 2-a

ed trained, tested 2-a

test tested 1-a 6-a ing training 1-a

(new type) 4a+b6 b

With probability 6+-,1-a we choose the existing morpheme test. This updates the morpheme lexicon to the following:

(7) State of the morpheme lexicon having already selected the words train-ed,

test-ed, train-ing, and the first morpheme of test-ing (N = 6, K = 4).

Morpheme Existing tokens Current probability of drawing this morpheme

train trained, training 2-a

ed trained, tested 2-a

test tested, test-?? 2-a 7-a ing training 7+b1-a (new type) 4agb

With probability , we choose the existing morpheme ing, and, having reached the number of desired morphemes, n = 2, we concatenate them to form the word testing.

Now let us take a closer look at the probabilities associated with drawing mor- phemes from the lexicon and figure out what assumptions they reflect.

The probability of choosing an existing morpheme, N+b'Y- hasa theh existingxitn tokenoe frequency of that token in the numerator. Thus, the larger the token frequency, the more likely it is to be chosen again - the "rich-get-richer " effect we see in

35 power laws. This biases the model towards choosing lexicons that have a roughly power-law (Zipfian) distribution (Goldwater et al., 2009), as we observe in natural language lexicons. Furthermore, as N grows larger, the probability K grows smaller, reflecting our confidence that as we observe more and more of the dataset, we have seen a large subset of the morphemes, and require more extraordinary evidence to pro- pose a new lexical entry in our morpheme lexicon. Overall, this means that more compact morpheme lexicons will have a higher probability than very expansive ones.

Next, we generate the word untrained. Having picked n = 3, as this will be gen- erated as un-train-ed, we now go down the choice point of creating an entirely new morpheme. We then have to generate the form of the new morpheme, which we do by drawing each segment from a uniform distribution over the segments. If the size of the segment inventory is IS1, the probability of drawing any one segment is . This is repeated as many times as we have segments I in the morpheme, and thus:

(8) Probability of drawing morpheme of length I=

The longer the morpheme, the smaller its probability; this disfavours under- segmentation, which results in very long morphemes such as untrained.

Once we complete the process of drawing every morpheme and composing them into words, we will have a morpheme lexicon L and a dataset of words D. The joint probability of this generative process, that is to say, the product of each and every one of the choices we made in generating our words: POISSON(n) for each word that was split into n morphemes (tokens), the Pitman-Yor probabilities for each choice of token, and the probability of generating the actual segmental sequence of each lexical entry in the morpheme lexicon. This probability is thus the probability of the lexicon, P(L), and the probability of drawing our dataset D from the lexicon, P(DIL). Since the lexicon is equivalent to our segmentation

36 hypothesis H, we'll re-label these P(H) and P(D H).

Looking back at the three segmentation hypotheses at the start of this chapter, we see that the over- and under-segmentation hypotheses are each disfavoured by two elements of the model:

Over-segmentation such as decomposing traininginto t-r-a-i-n-i-n-g, is penalised by the Poisson process used to pick the number of morphemes in each word, as it would require selecting a high n for each word every time. In addition, every time we pick a morpheme to append to our word, we incur the cost of drawing yet an- other token. Since we have to draw from our lexicon for many morphemes in each word, the overall probability of our P(D H) is small.

Meanwhile, the under-segmentation hypothesis is disfavoured by the form of

the Pitman-Yor probability of choosing a new morpheme, aKIb. With such a small number of words, this might seem to favour the over-segmentation hypothesis, but consider what happens when we extend the dataset to a larger number of words. For instance, the CELEX database (Baayen et al., 1995) contains only 54 phones but 160,595 words. Under-segmented hypotheses will result in overly large lexicons, which are disfavoured by this generative model. Furthermore, under- segmented hypotheses will result in long morphemes, which are disfavoured by how we modelled the selection of segments that comprise a new lexical entry.

Ultimately, because we have defined our generative model such that it is bi- ased in favour of more compact lexicons, each item of which is reused many times, but at the same time so that each word does not contain inordinately many mor- phemes, a lexicon like the one in Hypothesis 3 will prove to be best, in the tra- dition of Goldilocks: not too many or too few morphemes in the lexicon (there are approximately 17,500 unique morphemes in CELEX); words whose individ- ual morphemes are neither too long nor too short; and are not composed of too many morphemes (on average, just 2 per word, again according to CELEX). Such a lexicon will have the highest P(HID), threading the line between over- and under- segmentation, optimally balancing novelty and reuse.

37 2.1.2 Inference

Now, our goal is really to obtain hypotheses H with high posteriorprobability P(HI D): the probability of segmentation hypothesis, given our dataset of words. But by Bayes' rule:

= P(DII)P(H) (9) P(HID) P(D)

We can drop the denominator as our dataset is supplied to us and so its probability is irrelevant. Thus:

(10) P(H|D) c< P(D|H)P(H)

The most optimal segmentation is the H that maximises P (H ID),

(11) argmax P(HID) = argmax P(DIH)P(H) the right hand side of which is simply the joint probability defined by our genera- tive model. But how do we calculate this probability over all the possible segmentation hy- potheses H? The short answer is that we cannot: for any dataset of a reasonable size, there are so many possible segmentations that this is prohibitively computa- tionally expensive. Instead, we will conduct inference by drawing samples from the distribution P(HI D). For this we will employ a class of sampling algorithms called Markov Chain Monte Carlo (MCMC). Since this is an implementational detail (though by far the most challenging part of the simulation, engineering-wise), I will only give a brief, informal overview of the two most important algorithms in this class as applied to the problem of concatenative morphological segmentation.

The first algorithm is Gibbs sampling (Geman and Geman, 1984), which was applied by Goldwater et al. (2009) to perform inference on the GGJ model. We start with a completely random segmentation. This automatically induces

38 a morpheme lexicon with tokens assigned to each lexical entry. Next, for each word, we consider each boundary between two segments and decide whether or not to place a morpheme boundary at that spot. Let us call this variable bij when we are considering the ith word of the dataset and the jth phone boundary of word i. bij =1 when a morpheme boundary is placed there; bij = 0 otherwise. For example, for the word untrained, we have the following possible boundaries:

(12) Variables to be considered for word 5, untrained, currently segmented as un-t-rain-ed.

u n t r a i n e d '1 I I I I I I I

b5 1 0 b 52 =1 b 53 = 1 b 54 0 b 55 0 b56 = 0 b57 =1 b58 = 0 Let's say that we are at a stage where the other words have all already been segmented correctly. Thus our current morpheme lexicon looks like this:

(13) Current state of the morpheme lexicon (N = 12, K = 7)

Morpheme Existing tokens

train trained, training ed trained, tested, untrained

test tested, testing ing training, testing un untrained

t untrained

rain untrained We can calculate the probabilities of all the choices that led up to this state H and obtain the probability P(H|D).

Now consider b53. It is currently 1, meaning there is a hypothesised morpheme boundary between t and r. We contrast the current hypothesis H with a hypothesis

H' where b53 = 0, and calculate its associated probability P(H' D).

39 (14) State of the morpheme lexicon H' (N = 11, K = 5)

Morpheme Existing tokens train trained, training, untrained

ed trained, tested, untrained

test tested, testing ing training, testing

un untrained We then calculate whether we should switch from hypothesis H to H' by flip- ping a coin with probability:

(15) P (H H') P(DIH')P(H') P(D|H)P(H )+P(DjH')P( H')

We flip a coin with this weight to determine whether to move to this new hypoth- esis or not. Thus if the posterior probability of H' is 3 times that of H, then there is a 2 chance we move from the morpheme lexicon in H to the morpheme lexicon in H', along with its hypothesised segmentation. We do this cyclically, passing through the dataset several times, flipping a coin for every bij and adding the new hypothesised lexicon and segmentation to our Markov chain of posterior samples ("Markov" because each sample depends only on the previous one). Though at the beginning the samples are completely ran- dom, eventually, as we switch from model to model, we begin to spend more time in the high probability regions of the posterior distribution. In fact, the Gibbs sam- pling algorithm is guaranteed to provably converge to the correct posterior distri- bution from any random starting point. Finally, we take our Markov chain, discard the first m "burn-in" samples since it is unlikely that we randomly start in a good area of probability space and, since samples close to each other are highly correlated, retain every nth sample and discard the rest. We now have a good representation of the posterior distribution P(H ID), and can find the optimal segmentation H* = arg maxH P(H ID).

Under the second algorithm, Metropolis-Hastingssampling (Hastings, 1970), the

40 jumps we make between hypotheses are not limited to changing a single variable. New hypotheses h' are generated from the existing hypothesis h by a proposal dis- tribution g(h -+ h') of one's own choosing. For example, the proposal distribution could be: pick a number of bij's according to some distribution and toggle each of the bij's picked.

At each step, we sample a new hypothesis from the proposal distribution, and decide whether to accept the new proposal by calculating the Metropolis-Hastings criterion:

(16) A(h -+ h') = min 1P(h' 'P(h) g(h--h')

We flip a coin with this weight to decide whether to move to the new hypothesis. The resultant chain is once again a Markov chain of samples that provably reflects the posterior distribution once enough samples have been accepted, and as in the Gibbs case, we can use these samples to find the optimal segmentation.

The more general Metropolis-Hastings inference algorithm is useful under cer- tain conditions, for example when the equivalent of bij, unlike the binary choice above, has too many possibilities to efficiently enumerate, or when we wish to jump around hypotheses that differ by more than a single variable. This will be the case when we move towards sampling from our generative model for non- concatenative morphology, where the word formation mechanism is intercalation between root and residue.

A good choice of proposal distribution is the key to the efficient use of this tech- nique. While the algorithm is guaranteed to return a Markov chain that faithfully reflects the posterior distribution no matter what proposal distribution is used, even the uniform distribution, choosing a proposal distribution as close to the true posterior will result in more of the proposals being accepted, yielding more sam- ples in a shorter amount of time than a less informed proposal distribution.

41 2.2 Non-concatenative generative model

In this section, we re-cast the model introduced in 1.3.5 as a hierarchical genera- tive model. This model has the following steps:

(17) Generative model of non-concatenative morphology

a. For each word we wish to generate, pick a template length in terms of segments n for that word from a Poisson distribution.2 b. Draw a template of the appropriate length n from a template lexicon governed by a Pitman-Yor process. (i) If we decide to draw a new template type, then for each of the n slots in the template, flip a coin to determine whether it will be a root r or residue s slot. c. Count the number of r and s slots to determine how long the root and residue for this word will be. d. Draw a root of the appropriate length from a root lexicon governed by a Pitman-Yor process. (i) If we decide to draw a new root, then draw a segment from a uniform distribution over all segments until we reach the desired length. e. Repeat step (17-d) with residues, drawing a residue of appropriate length from a residue lexicon. f. Intercalate the root and residue according to the template.

As a concrete example, we might first pick a length of 5 for the template. We then go to the template lexicon, where we have a choice of one existing length-5 template rsrsr, or a completely new length-5 template. Drawing according to the Pitman-Yor probabilities, we happen to pick rsrsr. This means that our root will

2For the purposes of inference, this is unnecessary as each word in our dataset has a given length. However, in Chapter 3 we will run this generative model "forward" to obtain a typological distribution, which requires generation of word lengths; hence I make this step explicit here.

42 be of length 3 and the residue of length 2.3 We then move to the root lexicon and choose among the existing length-3 roots, or have the option to draw a new root of that length; we pick a root ktb. Similarly, we pick a length-2 residue ui. We then interpolate according to the template and obtain the word kutib. This is illustrated in the following diagram.

(18) Pictorial representation of our generative model for Arabic verb stems

Templates k t b

r s r s r -+ kutib rsrsr rsrssr new

Root length=3 Residue length=2

ktb drs new aa ui new

Roots Residues

2.2.1 Inference

Inference over this model was performed with Metropolis-Hastings sampling. We started with random templates being assigned to each word in the dataset. For each word in the corpus, the algorithm sampled a new template, root, and residue for the word by removing its current template, root, and residue from the three respective lexicons. Then we considered all the possible current templates plus one new, randomly sampled template.

3This is a design decision; we could also pick any root or residue of any length, and discard extra segments if too long or iterate to fill the template if too short, as is proposed in McCarthy (1979, 1981). More on this in Chapter 4.

43 For example, if the word in question was katab, we would consider all the length-5 templates in the template lexicon - for example, there might be three existing templates of length 5 rsrsr, rssss and rsrrr. Besides this, we consider the possibility of adding a fourth template of length 5, and sample an actual template for it, say rssrs.

We now need a distribution over this set of four hypothesised templates - our proposal distribution. In order to bias our choice towards templates that are more likely, we consider the probability of adding the template, the resultant root, and resultant residue to the three respective lexicons, according to the Pitman-Yor distributions, including the probability of creating the template, root, residue, if entirely new.

The sum of these probabilities will not be 1, so we renormalise to get a prob- ability distribution, then sample a template from this distribution. This will be our new hypothesis, and the renormalised probability with which it was picked is g(h -+ h'). We calculate g(h' -> h) in the reverse way, calculating the likelihood of proceeding from the new hypothesis to the old hypothesis with which we began.

We then calculate the Metropolis-Hastings criterion and flip a coin with that probability as to whether to accept the new hypothesis. If we do, we add the new template, resultant root and residue to their respective lexicons. If not, we revert to the old template, root and residue, and proceed to the next word. At the end of each "sweep" through the data, we capture a snapshot of the current segmentation and add that to our posterior distribution.

As a result of incorporating downstream information into the proposal distri- bution, we boost the probability of trying out hypotheses that have a higher P(h') and thus are more likely correct. For instance, if ktb has a high token frequency in the current lexicon, the probability of picking the template rsrsr from the proposal distribution will be higher for the word katab. This reduces the average number of proposals we discard before accepting a sample, reducing processing time for the algorithm.

44 2.3 Related work

There has been considerable work on unsupervised morphological learning and segmentation (see Hammarstrom and Borin (2011); Goldsmith et al. (2017) for recent overviews), as well as work on computational representations of Semitic templatic morphology, such as Kataja and Koskenniemi (1988)'s pioneering work showing that Semitic roots and patterns could be described using regular lan- guages, implemented by Beesley (1991) and others. However, until recently, most unsupervised learning has been restricted to con- catenative morphology, with generative models within the Bayesian approach pre- supposing concatenation as the only mode of word formation, and inference tech- niques relying on the assumption of concatenation to search the space of possi- ble segmentations. Where Bayesian morphological segmentation techniques have been applied to non-concatenative languages like Arabic (e.g. Lee et al. (2011); Narasimhan et al. (2015)), they have been applied to unvowelled text, since stan- dard Arabic orthography drops short vowels. This has the effect of reducing the problem largely to one of concatenative morphology. A specific area relevant to the present study is the automatic extraction of roots from Arabic and Hebrew words, vowelled and unvowelled depending on the study. This has been of particular interest as a preprocessing step to aid in lexicog- raphy and information retrieval (see Al-Shawakfa et al. (2010) for an overview). Initial techniques presupposed the existence of a lexicon (e.g. Buckwalter (2002)), or induced constraints from a training set of words and their corresponding roots (e.g. Daya et al. (2004); Rodrigues and Cavar (2007)). A more unsupervised approach to this problem comes from De Roeck and Al- Fares (2000), which clusters together words of the same root by taking advantage of the increased degree of overlap between their sets of character n-grams, after one strips out the affixes. The only hard-coded information was a list of affixes and some phonological knowledge to help specifically with weak roots, neither of which we tackled in our current model.

45 The broader problem of joint morphological learning over templates, roots, and all other morphemes has only been tackled more recently, in the work reported in Fullwood and O'Donnell (2013) and in this dissertation, as well as in Botha and Blunsom (2013), which also uses the framework of Bayesian segmentation Gold- water et al. (2009) and additionally implements a richer grammatical formalism, simple range concatenation grammars, which are capable of capturing both tem- platic and affixal morphology. However, their method relies on having a limited set of non-concatenative templates that is supplied to the learner, unlike our model in which the templates, roots, and residues (though not affixes) are all learned in an unsupervised manner. An alternative morphological learning framework to the Bayesian one is the Minimum Description Length-based (MDL) tradition, of which Linguistica (Gold- smith, 2000, 2001) is the foremost exemplar. Though Linguistica is still primarily geared towards concatenative morphology, work such as Xanthos (2018) seeks to extend the approach to templatic morphology. Xanthos' Arabica program takes a two-step approach, first partitioning the segment inventory into consonants and vowels, then using this information to iteratively decompose singular and plural nouns into roots (the consonants) and patterns (the vowels), using the MDL met- ric and other, more local, heuristics to drive the decompositions into an optimal state. Therefore, while this work does not have an explicitly-encoded consonant and vowel inventory, it does rely somewhat on the assumption that roots are pri- marily consonantal and patterns primarily vocalic. Other work entirely outside these two traditions includes Khaliq and Carroll (2013), which employs a technique called contrastive rescoring to score each hy- pothetical root based on the number of patterns that contain it, and conversely scoring hypothetical patterns based on the number of roots that occur with them, calculating this over every possible hypothesis of root and pattern. Xanthos (2018) notes that this algorithm, which was originally applied to undiacriticised (i.e. un- vowelled) text, tends to assign short vowels to roots rather than patterns as a result. Lastly, Meyer and Dickinson (2016) use a neural network-based formalism called

46 Multiple Cause Mixture Models to learn the underlying abstract "morphomes" that cause the surfacing of each segment in a word. This model is general enough to capture both concatenative and non-concatenative morphology, however, in their experiments Meyer and Dickinson (2016) test this technique only on unvow- elled Arabic text, and thus do not fully exploit the templatic possibilities of their system.

2.4 Experiments

We applied the morphological model and inference procedure described in 2.2 to two datasets of Arabic and English.

2.4.1 Data

The Arabic corpus for this experiment consisted of verbal stems from the intersec- tion of the verb concordance of the Quranic Arabic Corpus (Dukes, 2011). All pos- sible active, passive, perfect and imperfect fully-vowelled verbal stems for Forms I-X, excluding the relatively rare Form IX, were generated. We used this corpus rather than a lexicon as our starting point to obtain a list of relatively high fre- quency verbs. This list was then filtered through the Buckwalter stem lexicon (Buckwalter, 2002) to obtain only stems that occur in Modern Standard Arabic. This list of stems was further filtered to include triconsonantal "strong" roots were considered. The so-called "weak" roots of Arabic either include a semi-vowel or a doubled consonant and undergo segmental changes in various environments, which cannot be handled by our current generative model. This process yielded 1563 verbal stems, comprising 427 unique roots, 26 residues, and 9 templates. Each root occurred between 1 and 13 times in the dataset, with the median frequency being 3. The stems were supplied to the sampler in the Buckwalter transliteration. The English corpus was constructed to be as similar as possible to the Arabic

47 corpus. All verb forms related to the 299 most frequent lemmas in the Penn Tree- bank (Marcus et al., 1999) were used, excluding auxiliaries such as might or should. Each lemma thus had up to 5 verbal forms associated with it: the bare form (for- get), the third person singular present (forgets), the gerund (forgetting), past tense (forgot), and past participle (forgotten). This resulted in 1549 verbal forms, comprising 295 unique roots, 108 residues, and 55 templates. The median frequency of roots in this database was 5. CELEX (Baayen et al., 1995) pronunciations for these words were supplied to the sampler in CELEX's DISC transliteration. Determining what we would count as the canonical, gold-standard templates for the English verbs was less straightforward than in the Arabic case. The follow- ing convention was used: the root was any subsequence of segments shared by all the forms related to the same lemma. Thus, for the example lemma of forget, the correct template, root and residue were deemed to be:

(19) Correct analyses under the root/residue model for the lemmaforget

Word Transliteration Template Root Residue forget f@gEt rrrsr f@gt E forgets f@gEts rrrsrs f@gt Es forgot f@gQt rrrsr f@gt Q forgetting f@gEtIN rrrsrss f@gt EIN forgotten f@gQtH rrrsrs f@gt QH This is a simple extension to the non-contiguous case of the convention in com- putational morphology that the stem is the longest substring (contiguous subse- quence) shared by all words in a paradigm, for example in Goldsmith (2001)'s Lin- guistica program and Albright and Hayes (2003)'s Minimal Generalization Learner. Using this criterion, the English data contained 55 templates, of which 37 tem- plates were concatenative and 18 non-concatenative. The latter were necessary to accommodate 46 irregular lemmas associated with 254 forms. The following table summarises the key statistics for the two datasets.

48 (20) Key statistics for the Arabic and English datasets

Arabic English Number of stems 1563 1549 Roots 427 295 Residues 26 108 Templates 9 55 Median root frequency 3 5 Mean word length 5.7 6.4 Median word length 5 6

2.4.2 Results

We ran 10 instances of the sampler for 200 sweeps through the data. For the Arabic training set, this number of sweeps typically resulted in the sampler finding a local mode of the posterior, making few further changes to the state during longer runs. Evaluation was performed on the final state of each sampler instance. The correctness of the sampler's output was measured in terms of the accu- racy of the templates it predicted for each word. The word-level accuracy indi- cates the number of words that had their entire template correctly sampled, while the segment-level accuracy metric gives partial credit by considering the average number of correct bits (r versus s) in each sampled template.

(21) Example calculation:

Word 1 Word 2 Correct Template rsrsr rsrsr Sampled Template rsrsr rssrr Word-level accuracy 1.0 0.0 Segment-level accuracy 5/5 = 1.0 3/5 = 0.6

The accuracy for the MH sampler was then averaged over the ten samples and weighted by each sample's joint probability.

49 As a baseline, I calculated the accuracy of a set of random templates, each cho- sen by randomly flipping a fair coin for the choice of root or residue for each slot in the template.

(22) Average accuracy: Arabic

Accuracy Word-level Segment-level MH Sampler 92.3% 98.2% Random Sampler 4.8% 67.6%

The value of 67.6% for segment-level accuracy in the random case is greater than the expected 50% because our main concern is with correct segmentations, therefore rsrsr and srsrs are considered correct segmentations for a word with "canonical" template rsrsr. Both the word-level and segment-level accuracies were calculated accordingly. An identical experimental set-up was used for English, with the addition of a second baseline in the form of applying the GGJ model for word segmentation to English concatenative morphological segmentation, supplying individual words in place of sentences. 4

(23) Average accuracy: English

Accuracy Word-level Segment-level MH Sampler (non-concat) 43.9% 85.3% Random Sampler (non-concat) 1.5% 55.0% GGJ Sampler (concat) 71.6% 89.7%

2.4.3 Error analysis: Arabic

The graph in (24) shows the average unweighted accuracy with which each of the nine Arabic templates was sampled.

4 Using code made available by Sharon Goldwater at http://homepages.inf.ed.ac.uk/ sgwater/software/dpseg-1.2.1.tar.gz.

50 (24) Unweighted accuracy with which each template was sampled

100

80-

M 60-

U U 40-

20-

This graph shows that rarer and longer templates were less likely to be correct. For instance, the performance on template rssrsr(second bar from left) is excep- tionally low, but this is the result of there being only one instance of this template in the training set: /Euwqib/, the passive form of the Form III verb of root v\/Eqb, in the Buckwalter transliteration.5 Furthermore, the longer the word, the poorer the performance of the model. This is likely the result of the difficulty of searching over the space of templates for longer forms. Since the number of potential templates increases exponentially with the length of the form, finding the correct template becomes increasingly difficult. Interestingly, one frequent error made by the segmenter was to place all conso- nants, including consonants that theoretically belonged to the residue, in the root. With very few exceptions, segments belonging to the root were never split between the root and residue, but were always clustered into a single morpheme. The problem of long template lengths can likely be addressed in future mod- els by adopting an analysis similar to McCarthy's whereby the residue is further subdivided into vocalism, prefixes and infixes. When all affixes are removed, 5This is an artifact of the Buckwalter transliteration, which induces different templates for the active form /EAqab/ (rsrsr) versus the passive form /Euwqib/ (rssrsr). The passive template for Form III occurred only once in our sampling of the Quran verb concordance.

51 the remaining stem would be relatively short and would result in better non- concatenative segmentation.

2.4.4 Error analysis: English

The dominant pattern of errors made by the non-concatenative sampler was in the direction of overgeneralisation of the concatenative templates to the irregu- lar forms. Out of the 254 words related to a lemma with an irregular past form, 241 received incorrect templates, 232 of which were concatenative, often correctly splitting off the regular suffix where there was one. For example, sing and singing were parsed as sing and sing-ing, while sung was parsed as a separate root. Note that under an analysis of English irregulars as separate memorised lexical items, the sampler behaved correctly in such cases. However, out of 1295 words related to perfectly regular lemmas, the sampler determined 628 templates incorrectly. Out of these, 325 were given concatenative templates, but with too much or too little segmental material allocated to the suf- fix. For example, the word invert was analyzed as inver-t, with its other forms following suit as inver-ted, inver-ting and inver-ts. This is likely due to subregular- ities in the word corpus: with many words ending with -t, this analysis becomes more attractive. Indeed, such reanalyses of morpheme boundaries, also known as "re-cutting", known to occur historically (Campbell, 1998). The remaining 303 regular verbs were given non-concatenative templates. For instance, identify was split up into dfy and ienti. No consistent pattern could be discerned from these cases.

Let us contrast this with the results of the GGJ-based concatenative sampler. This had a much higher accuracy overall, and reliably segmented out the com- mon suffixes such as -ed (represented as /t/ and /Id/ in the DISC transliteration), -ing, textit-s. Since it was capable of segmenting words into more than two mor- phemes, it also regularly segmented words like representingas re-present-ing. This was counted as incorrect with respect to the gold standard defined above, but is

52 actually a more reasonable parse. This also shows that once the concatenative templates incorrectly identified as non-concatenative, such as identify above, were correctly parsed as concatenative, there was greater evidence found for the affixes -ing, -ed, etc., swinging the balance of the evidence back in favour of reusing the affixes rather than positing new ones. Moving on to the words with ablaut, the concatenative sampler generally re- turned parses such as get, got, get-ing, got-en, and get-s, similar to the errors made by the non-concatenative sampler. These were counted as incorrect in the results table above, though these parses are eminently reasonable. This sampler's only true errors were in the direction of over-segmentation, for example in words such as bend, the final d was split off as if it was a past tense morpheme. This had the added benefit of allowing bend and bent to share the same "root", ben.

2.5 Discussion and Future Work

The success of Bayesian inference on the proposed computational model at sepa- rating roots from residues in Arabic showed that there is sufficient statistical signal in Arabic verbs that a model with only a bias towards parsimonious morpheme lexicons is sufficient to segment out roots from residues. This was in spite of the model haing no information on likely templates, and having no knowledge of the consonant-vowel distinction. The fact that the sampler identified the root as a morpheme in over 90% of the Arabic verbal stems is interesting, as this was not a forgone conclusion. There have been proposals for other possible breakdowns of Arabic words, chief among these being the "etymon" proposal of Bohas (1997), which suggests that roots can be further broken down into biconsonantal morphological units. For example, the following semantically-related words have different roots but share the two con- sonants {b,t}:

53 (25) a. batta 'cut off' (Root: bt) b. batara 'sever' (Root: btr) c. balata 'sever' (Root: bit) d. bataka 'separate' (Root: btk) e. sabata 'cut down' (Root: sbt)

Boudelaa and Marslen-Wilson (2001) show evidence of morphological priming of the etymon, though follow-up psycholinguistic experiments (Bentin and Frost, 2001; Idrissi and Kehayia, 2004; Mahfoudhi, 2007) have not found evidence of the psychological reality of the etymon. The fact that triconsonantal roots were identified rather than, say, the bicon- sonantal etymon is therefore somewhat significant, though not a surprise, as any etymon-based breakdown of the verbal stems would result in a far larger number of possible templates than the very economical nine root-based templates seen in this corpus. A more naturalistic corpus, particularly one that encompasses weak roots, could conceivably shift the preponderance of evidence away from roots. An interesting follow-up experiment, then, would be to repeat this inference experiment on such a corpus. This more naturalistic corpus might also expand from merely the verbal stems to include the nominal stems of Arabic as well.

Unlike in Arabic, the non-concatenative segmenter failed to learn a principled set of English verbal stems and suffixes with a similar quantity of data and train- ing time, despite theoretically being representable with this generative model and, as seen from the results of the concatenative sampler, there being a robust statisti- cal signal allowing for successful segmentation - but only when we allowed for concatenation to be the mode of word-building. Overall, these results suggest that to handle morphological segmentation across these classes of languages, we will need to incorporate both concatenation and non-concatenative intercalation as primitive word-building operations into the same model. This would result in better morphological segmentation performance in

54 English verbs, while also having a mechanism for handling the minority ablaut cases. Even in non-concatenative Arabic, this model would have the advantage of be- ing able to handle Arabic morphology outside of the stem, which is largely affixal. This may also improve some of the (already quite excellent) segmentation results of the sampler, as the lengths of the templates on which non-concatenative segmen- tation has to be performed decreases, and the total number of non-concatenative templates that have to be learned decreases. For example, the template for the Form V verb [tafaTaf] might be reduced to the shorter template for the Form II verb [faTSafl plus an additional prefix. A dual-mode morphological model raises the question: in a model where both modes of word formation are available, is the learner's default assumption that morphology is concatenative, non-concatenative, or is the learner simply agnostic, letting the data speak for itself? The question of whether human learners have a bias towards concatenativity is addressed in Chapter 3. In addition, I investigate in Chapter 3 whether there is psycholinguistic evi- dence for a bias based on the consonant-vowel distinction to be incorporated into the model. We saw in this chapter that this knowledge was not necessary to suc- cessfully segment Arabic, but in examining the behaviour of human learners, do we find that they have a bias towards roots that are consonantal and residues that are vocalic, as we see in Arabic?

55 56 Chapter 3

Artificial Grammar Experiment

The computational model of Arabic non-concatenative morphology presented in the previous chapter is extremely simple compared with the linguistic models pro- posed in the morphology literature. It also makes predictions that are contradicted by the typology of morphological systems extant in the world's languages. In this chapter, I demonstrate this fact, and further show that Prosodic Morphology, a leading theory for handling non-concatenative morphology, suffers from similar typological problems.

I propose that this demonstrates the need to build in biases in favour of con- catenative morphology as well as preferences for root-template morphology to be based on consonantal roots interleaved with vocalic morphemes. I then test for this bias via an artificial grammar experiment in which participants are trained on words from three "alien" languages: one concatenative, one Arabic-like, and one unattested non-concatenative language in which roots and residues are composed of a mixture of consonants and vowels, and tested on their recognition of new words, their performance reflecting the degree of morphological decomposition they undertook in each of the three languages.

57 3.1 Pathological predictions

A morphological system generated entirely by the proposed computational model has an extremely low probability of being entirely concatenative. The first step of the model involves selecting a template by tossing a coin for each segmental slot to decide whether it should be a root or residue segment. Concatenative templates would be those consisting entirely of a sequence of 'r's followed by 's's, or vice versa.

(1) Number of templates of length 5 (Note: only stems beginning with root segments are shown below; the "mirror images" of these are also possible, e.g. a prefixing counterpart to 'rrrss' would be 'ssrrr', and so on.)

Concatenative Infixing Interleaving rrrrs rrrsr rrsrs rrrss rrsrr rsrrs rrsss rsrrr rsrsr rssss rrssr rsrss rssrr rssrs rsssr

Observe that for templates of length 5, the likelihood of drawing a non-concatenative template exceeds that of drawing a concatenative template. The imbalance to- wards non-concatenative becomes more acute as the number of segments grows:

(2) Number of templates of length 6

Concatenative Infixing Interleaving rrrrrs rrrrsr rrrsrs rrrrss rrrsrr rrsrrs rrrsss rrsrrr rrsrsr rrssss rsrrrr rrsrrs

58 Concatenative Infixing Interleaving rsssss rrrssr rrssrs rrssrr rsrrrs rssrrr rsrrsr rrsssr rsrrss rsssrr rsrsrr rssssr rsrsrs rsrssr rsrsss rssrrs rssrsr rssrss rsssrs

To illustrate this more concretely, I ran the computational model forward and generated 1000 systems of 20 tokens each, drawing a segmental length for each token from a Poisson distribution with A = 5, indicating that the most frequent token length should be 5 segments. There was an equal probability for each slot in a template to be a root slot r or residue slot s. Of the 1000 systems generated, 998 produced at least one interleaving word, while in 142, more than half the words in the system were interleaving, a gross over-generation of root-and-pattern-style non-concatenativity that is contra the typology of known languages. Furthermore, the computational model has no built-in concept of consonants and vowels, unlike McCarthy (1979, 1981) in which this distinction was crucial, instead picking a random segment for each segmental slot in a root or residue. This was modelled in the simulation by picking a 'C' or 'V' to go into each slot by flipping a fair coin. As a result, the majority of the non-concatenative words generated were a mixture of C's and V's. All 998 systems with at least one non-

59 concatenative word had at least one "unattested" intercalation of this kind. In 115 systems, the majority of tokens were of this sort. To my knowledge, however, systems with a systematic interpolation of C's and V's are unattested. Non-concatenative systems, including ablaut, generally involve vocalic alternations amid a consonantal skeleton.

It is tempting to ascribe these pathological predictions to the lack of linguis- tic sophistication of the model under question, in particular its arbitrary gener- ation of non-concatenative templates. Under a theory of morphology in which non-concatenativity must be motivated, such non-concatenativity should be lim- ited. However, we shall see that even a state-of-the-art theory of non-concatenative morphology such as Prosodic Morphology McCarthy and Prince (1993b), which only permits non-concatenativity when overridden by phonotactic requirements, makes similar pathological predictions.

3.1.1 Predictions of Prosodic Morphology

In Prosodic Morphology (McCarthy and Prince, 1993b, et seq.), which is imple- mented under the rubric of Optimality Theory (henceforth OT; Prince and Smolen- sky (1993) et seq.), concatenativity is enforced by the constraint CONTIGUITY.

(3) a. CONTIGUITY: Assign a * for every pair of segments that are contiguous in an input morpheme that are not contiguous in the output

Just as with any OT constraint, CONTIGUITY is violable and can be over-ridden by a markedness constraint that is higher-ranked. For example, in Tagalog the morpheme um infixes into the verb after the first consonant (cluster), despite this violating CONTIGUITY.

(4) um infixes into the stem of Tagalog verbs (French, 1988; McCarthy and Prince, 1993a):

a. sulat -+ sumulat 'write'

60 b. gradwet -+ grumadwet 'graduate'

This is because ONSET, which requires that every syllable begin with a consonant, is undominated in Tagalog, overriding the violation that the infixed form incurs from CONTIGUITY.

(5) ONSET > CONTIGUITY

/um,sulat/ ONSET CONTIGUITY

a. umsulat

b. e sumulat *

This ranking currently does not distinguish between the unattested candidate *[sulumat] and the actual winning candidate [sumulat], which incur exactly the same set of violations. McCarthy and Prince (1993a) hypothesise that the infix um is really a prefix, and wants to be as close to the front of the word as possible. An alignment constraint applies to um, requiring it to be as close to left edge of the stem as possible.

(6) ALIGN(um-, LEFT, WORD, LEFT): Assign a * for every segment that inter- venes between the left edge of um- and the left edge of the word.

If this were unviolated, then *[umsulat] would be the winner. However, like CON- TIGUITY, it too is dominated by ONSET. This ranking gives us the correct predic- tion:

61 (7) ONSET > CONTIGUITY, ALIGN(um-, LEFT, WORD, LEFT)

/um,sulat/ ONSET CONTIGUITY ALIGN-urn-LEFT

a. umsulat *!

b. ew sumulat *

C. sulumat *

The um- alignment constraint is part of a larger family of ALIGN constraints of the following general form:

(8) ALIGN(Catl, Edgel, Cat2, Edge2, SCat): Assign a * for every SCat (segment, syllable, etc.) that intervenes between Edgel (Left or Right) of Cati and Edge2 (Left or Right) of Cat2. (McCarthy and Prince, 1993a)

This constraint ensures alignment to the edge of some prosodically-prominent Cat2 such as stem or prosodic word edges or adjacent to a stressed syllable, and can be applied not only to non-concatenative morphology such as infixation and reduplication, but also to other important linguistic phenomena such as metrical .

What kind of typology do the constraints that make up the typical Prosodic Morphology analysis actually predict? To answer this question, let us examine a factorial typology with the following six constraints: one enforcing contiguity, one phonotactic constraint, and four ALIGN constraints.

(9) a. CONTIGUITY: Assign a * for every pair of segments that are contiguous in an input morpheme that are not contiguous in the output (=(3-a)) b. C//V: Assign a * for every consonant that is not adjacent to a vowel in the output c. ALIGN(ROOT, LEFT, STEM, LEFT): Assign a * for every segment that intervenes between the left edge of the root and the left edge of the

62 stem. d. ALIGN(RESIDUE, LEFT, STEM, LEFT): Assign a * for every segment that intervenes between the left edge of the residue and the left edge of the stem. e. ALIGN(ROoT, RIGHT, STEM, RIGHT): Assign a * for every segment that intervenes between the right edge of the root and the right edge of the stem. f. ALIGN(RESIDUE, RIGHT, STEM, RIGHT): Assign a * for every segment that intervenes between the right edge of the residue and the right edge of the stem.

To keep the typology manageable, we will consider only the following possible inputs and candidates.

(10) a. Inputs: Combinations of roots of length 3 and residues of length 2 that, between them, consist of three consonants and two vowels, to match the stimuli in the experiment to be discussed below. There are =- 10 unique inputs of this sort. b. Candidates: All interpolations of the input root and residue that re- spect MAX, DEP, and LINEARITY: no deletions, insertions, or re-ordering of segments within the root and residue, to match the intercalation scheme used in the computational model. There are 5= 10 candi- dates for each of the inputs.

Take the following sample tableau. Note that this constraint ranking, with a fairly innocuous pair of input morphemes of shapes CCV and VC, results in an unat- tested systematic non-concatenative intercalation of morphemes.

(11) Tableau: C//V, ALIGN-RT-L, ALIGN-REs-R > CONTIGUITY, ALIGN-RT- R, ALIGN-REs-L (latter two omitted for space).

63 /C 1 C 2V 1 / + /V 2C3 / C//V ALIGN-RT-L ALIGN-RES-R CONTIG

a. C1C2V1V2C3 *

b. O C1V2C2V1C3 **

C. C1V2C2C3V 1 ***

d. V2 C 1C2 V1 C3 *

We can go beyond this particular example to compute the full factorial typology of 6! = 720 possible rankings of these constraints.1 By the hypothesis of Richness of the Base (Prince and Smolensky, 1993), every grammar should be able to handle every possible input. We consider the ten inputs defined in (10-a). Each of these in turn has ten possible candidates defined in (10-b), intercalations of the root and residue that respect linear order within each input morpheme. Computing the winning candidate out of these ten, for each of the 10 inputs per ranking, we find that still:

(12) a. 444 of the 720 rankings (62%) resulted in at least one non-concatenative interleaving of the residue into the root (not counting infixing forms). b. 296 of the 720 rankings (41%) resulted in the majority of possible in- puts into that ranking yielding a non-concatenative winner. c. 440 of the 444 non-concatenative rankings (99%) had at least one unat- tested interleaving of a residue consisting of a mixture of C's and V's into a root consisting of a mixture of C's and V's. d. 244 of the rankings resulted in the majority of interleavings being of the unattested type.

Thus, we find that despite concatenativity being the default in Prosodic Morphol- ogy, this framework still vastly over-generates non-concatenative morphologies, and, moreover, over-generates the "unattested" morphologies where mixtures of consonants and vowels are interleaved together.

1Scripts and detailed results for all simulations are available online at https: //github. com/ michelleful/DissertationMaterials/overgeneration-pathologies.

64 This is because there is really only one constraint militating for concatenativ- ity: CONTIGUITY. ALIGN only addresses edges and does not concern itself with maintaining the intactness of morphemes. Thus the exigencies of phonotactic con- straints such as C//V, and the requirements of ALIGN constraints, can and do override CONTIGUITY to produce a wide array of non-concatenative and unat- tested candidates.

When we have a case of over-generation in Optimality Theory, we can either dismiss it as an accidental gap; alternatively, we can try to constrain the over-broad typology to produce only the observed set of phenomena by redefining constraints or asserting fixed rankings between them. In the next two subsections I look at two proposals from the literature address- ing the domain of non-concatenative phenomena that potentially fall into the latter categories: McCarthy's (2003) proposal to replace gradient ALIGN constraints with categorical equivalents; and Zukoff's (2017) Mirror Alignment Principle proposal, which says that when X c-commands Y in the , ALIGN(X) > ALIGN(Y) in the phonology. Let us see how these proposals affect the typology above.

3.1.2 Replacing gradient ALIGN with categorical ANCHOR

McCarthy (2003) proposes that all OT constraints should be categorical, that is, for every "instance of [...] marked structure or [...] unfaithful mapping that the constraint proscribes", exactly one violation mark is assigned. The definition of alignment that we have been using thus far is not categorical, under this definition, because the structure whose markedness is under consider- ation is the position of the morpheme edge. When the morpheme edge is multiple segments away from the word edge, multiple violation marks are incurred, violat- ing McCarthy's definition of categoricality. McCarthy (2003) claims that gradient ALIGN over-generates infixation typolo- gies. Returning to the problem of Tagalog um- infixation, there is a class of verbs for which such infixation is proscribed: those that begin with /(s)m/ and /w/:

65 (13) Verbs beginning with labial consonants do not permit um- infixation (Orgun and Sprouse, 1999)

a. mahal *mumahal 'to become expensive' b. smajl *summajl ~ *smumajl 'to smile' c. walow *wumalow 'to wallow'

Orgun and Sprouse (1999) attribute this to the action of OCP[labial], which bans two successive sonorant labials. As both they and McCarthy (2003) point out, how- ever, it should be possible for the prefix to move even deeper into the verb stem in order to avoid violating OCP[labial], for example:

(14) Deeper infixation is still not tolerated, despite not violating OCP[labial].

a. *mahumal b. *mahalum

The NULL OUTPUT Q has to win out over such candidates, yet it cannot while ALIGN is gradient:

(15) If ALIGN-um-L > MPARSE (the only constraint violated by the NULL OUT- PUT Q), then no infixation is possible:

/Um,sulat/ ONSET' OCP(labial) ALIGN-um-L CONTIG MPARSE

a. umsulat

b. Q sumulat I

C. _ ___

66 (16) Yet if MPARSE >> ALIGN-um-L, then *[mahumal] should be possible:

/um,mahal/ ONSET OCP(labial) MPARsE ALIGN-Um-L CONTIG

a. ummahal

b. mumahal * '*

c. aw mahumal ***

d. Q *

McCarthy's approach to resolving this ranking paradox is to argue that ALIGN is not gradient, but should be replaced with two anchor constraints: PREFIX-uM- Seg, which requires that the prefix not be preceded by any segment, and PREFIX- um-o-, which requires that the prefix not be preceded by any syllable. Splitting the work of ALIGN into two categorical constraints that can be ranked on either side of MPARSE resolves the problem:

(17) PREFIX-uM-Cr >> MPARSE >> PREFIX-um-Seg permits the infix to move just inside the word boundary...

um,sulat 11ONSET OCP(labial) | PREFIX-uM-Cr MPARSE PREFIX-um-Seg

a. umsulat *!

b. uw sumulat c. Q *!

67 (18) ... and simultaneously allows the null candidate to win over deeper infixa- tion:

um,mahal ONSET OCP(labial) PREFIX-uM-cr MPARSE PREFIX-um-Seg

a. ummahal *

b. mumahal I* C. mahumalI

d. ew *

More generally, the categorical correspondent to ALIGN is ANCHOR (originally defined in McCarthy and Prince (1995)), which is defined as follows:

(19) a. ANCHOR(ROOT, STEM, LEFT): Assign a * if the left edge of the root

does not coincide with the left edge of the stem.

b. ANCHOR(RESIDUE, STEM, LEFT): Assign a * if the left edge of the residue does not coincide with the left edge of the stem.

c. ANCHOR(ROOT, STEM, RIGHT): Assign a * if the right edge of the root

does not coincide with the right edge of the stem.

d. ANCHOR(RESIDUE, STEM, RIGHT): Assign a * if the right edge of the residue does not coincide with the right edge of the stem.

Ruling out gradient ALIGN does not merely make it possible to explain the facts of Tagalog infixation while forbidding infixes from moving arbitrarily deep into the stem to avoid incurring markedness violations; more generally, shifting to cat- egorical ANCHOR-type constraints rules out unattested "slight" misalignments of morphology and , which is "generally all or nothing" (McCarthy, 2003).

Turning back to our factorial typology and replacing the four gradient ALIGN

constraints with their categorical ANCHOR counterparts, however, we find that:

(20) a. 390 of the 720 rankings (54%) result in at least one non-concatenative

output.

68 b. 136 of the 720 rankings (19%) result in more than half of their inputs producing non-concatenative outputs. c. 360 of the 390 non-concatenative rankings (92%) include at least one unattested output with mixed-consonant-and-vowel roots interleav- ing with mixed consonant-and-vowel residues. d. 60 rankings have more than half of their inputs producing unattested- type outputs.

While the percentages of "bad" rankings decreased, the replacement of gradi- ent ALIGN with categorical ANCHOR still over-predicts the occurrence of non- concatenative morphologies, and is insufficient to exclude unattested interleaving CV-roots and CV-prefixes, as in the following example:

(21) ANCHOR-RES-R, ANCHOR-RT-L > C//V > CONTIGUITY, ANCHOR-RES- L, ANCHOR-RT-R (latter two omitted for space)

/C 1 C 2V 1,V2C3 / ANCHOR-REs-R ANCHOR-RT-L C//V CONTIG

a. C1C 2V1V 2C3 *!

b. UW C1V 2C2V1C3

C. ClV2 C3C2 V

d. V2 ClC2VlC3 i *

(Sidenote: the definition of CONTIGUITY was not altered, as it is categorical under McCarthy's definition. This faithfulness constraint refers to three separate contiguity relations between, in the case of this tableau, C1 // C2, C2 // V1 and V2 //

C3, incurring only one possible violation mark for each.)

We have seen, then, that moving to categorical constraints does not rule out unattested typologies. We turn now to a proposal that fixes the ordering between some ALIGN constraints, to see if that fares any better.

69 3.1.3 Predictions of Mirror Alignment-induced fixed rankings

Zukoff (2017ab) proposes the Mirror Alignment Principle (MAP), a formal mech- anism within Optimality Theory that implements Baker (1985)'s Mirror Principle:

(22) a. The Mirror Principle: Morphological derivations must directly reflect syntactic derivations. Baker (1985) b. If a asymmetrically c-commands /, then ALIGN-a > ALIGN-f.

With the MAP, Zukoff (2017a) seeks to explain the following ranking paradox in Arabic. Four of the Arabic verbal forms include the morpheme /t/, usually termed a 'reflexive' morpheme. In three of the four forms, /t/ is prefixal, but in one, Form VIII, it is infixal.

(23) a. Forms where /t/ is prefixal, with the example root Vktb 'to write': (i) Form V: takattaba (Reflexive + Causative) (ii) Form VI: takaataba (Reflexive + Applicative) (iii) Form X: (?i)staktaba (Causative + Reflexive) b. Form where /t/ is infixal: (i) Form VIII: ?iktataba (Reflexive)

In the former three forms, then, ALIGN-REFLEXIVE-L must dominate ALIGN-RoOT- L. However, in the infixal form VIII, ALIGN-RoOT-L > ALIGN-REFLEXIVE-L. There are various proposals for how to bypass this ranking paradox: some au- thors assume the infixal /t/ is in fact a separate morpheme altogether (e.g. Us- sishkin (2003)), negating the paradox altogether; a different approach, as in Tucker (2010), is to assume that Form VIII is special, in line with McCarthy (1981)'s Eighth Binyan Flop rule, and introduce a special alignment constraint for when this mor- pheme occurs in Form VIII. Zukoff claims instead that the positioning of the prefixes is a consequence of the syntactic derivation of these forms. In Forms V, VI, and X, the /t/ morpheme c-commands the root, and thus ALIGN-REFL-LEFT > ALIGN-ROOT-LEFT. In Form

70 VIII, on the other hand, the root and reflexive are sisters, and so neither stands in asymmetric c-command relation to the other.

(24) Zukoff's proposed syntactic structures for reflexive-containing words

a. Form V takatetaba b. Form VIII (?i)ktataba

Refi Refi Refi Caus Refi Root

Root Caus /t/ /ktb/ /t/ I I /ktb/ /pc/

In this case, Zukoff argues, there is a fallback to the following default ranking for Arabic:

(25) When the MAP provides no ranking statement, ALIGN-ROOT-L is top- ranked by default. (Zukoff, 2017a)

This resolves the ranking paradox, and also offers an avenue to narrow our typol- ogy by fixing some of the rankings.

Let us suppose first that the residue is prefixal and c-commands the root, mean- ing that ALIGN-REs-L is active and, per the MAP, dominates ALIGN-RoOT-L. ALIGN-REs-R is inactive. I assume that ALIGN-RoOT-R is still active and can oc- cur anywhere within the ranking. With only 5 constraints active, we have 5! = 120 rankings, half of which respect ALIGN-REs-L > ALIGN-ROOT-L, so there are 60 rankings to consider. Of these:

(26) a. 35 of the 60 rankings (58%) yield at least non-concatenative output. b. 21 of the 60 (35%) have more than half of their inputs yielding non- concatenative outputs.

71 c. All 35 non-concatenative rankings yield at least one unattested-type output. d. 15 rankings yield more than half unattested outputs.

Due to symmetry, we obtain the same numbers when the residue is assumed to be a suffix, with ALIGN-REs-R > ALIGN-ROoT-R, and ALIGN-REs-L inactive.

In this section, we found that despite having very different views on how non- concatenativity and consonant-vowel interactions arise, both the simple compu- tational model of Chapter 2 and state-of-the-art Prosodic Morphology give rise to pathological predictions. Two attempts to correct the factorial typology, first by altering the definition of the ALIGN constraint, and second by fixing some of the rankings, did not resolve matters for Prosodic Morphology. In this chapter, I argue instead that, first, there must be an external bias for concatenativity. Second, when the pressures of morphology result in interleaving morphemes as in Arabic, there is a bias for roots to be consonantal and for the interleaved material to be vocalic, henceforth called the CV-bias. These biases may be diachronic or synchronic. If diachronic, or non-existent, then we would expect language learners to do equally well learning typologically- rare or unattested languages as typologically-common ones. If active and syn- chronic, on the other hand, the bias should predispose learners to more easily learn concatenative languages over non-concatenative languages, and CV-respecting lan- guages over ones that have mixed CV roots and residues. The remainder of this chapter presents a psycholinguistic experiment inves- tigating how well participants decompose words in a novel language into mor- phemes demonstrates evidence for both biases being present and synchronic.

3.1.4 Previous work

The pioneering work of Saffran et al. (1996a,b); Aslin et al. (1998) showed that learners, both children and adults, could distinguish words from part-words pre-

72 sented in a continuous stream of syllables, showing that they had successfully per- formed word segmentation on the stream. However, once non-adjacent dependencies were introduced into this experi- mental design, the task became demonstrably more difficult, even impossible. In Newport and Aslin (2004), learners were given the same task as above, but where the trisyllabic words were of the form 'rrssrr', where the first and third syllable consistently co-occurred, but with different intervening medial syllables. Despite prolonged exposure, and despite having similar statistical complexity to the con- catenative stimuli in Saffran et al. (1996b), learners were unable to reliably segment words from this stream. This contrast adds support for a synchronic bias towards concatenativity, in the context of the task of word segmentation. Another statistical learning experiment providing evidence for a synchronic bias towards concatenativity is Drake (2018), which contrasted participants' abil- ity to segment words out of a stream, where the words were either generated from a concatenative morphological grammar, or a non-concatenative one. Drake found that participants from a variety of language backgrounds - English, Ara- bic, and Maltese - found it easier to learn the concatenative grammar than the non-concatenative one.

There is less work directly examining the CV-bias of the sort we are looking for, but there are two relevant areas of research in the literature. The first is on learning non-adjacent dependencies among consonantal subsequences and vocalic subsequences. For instance, Newport and Aslin (2004) went on to test learners' abilities to seg- ment out trisyllabic words with non-adjacent dependencies between consonants, and separately, non-adjacent dependencies between vowels, designed to mimic Arabic- and Hebrew-style root-and-pattern morphology. This time learners were able to successfully learn and segment out words, in contrast to the artificial lan- guage with non-adjacent syllabic dependencies. Newport and Aslin (2004) suggest that the learning mechanism behind this success might be because learners were

73 able to track transitional probabilities between segments on different autosegmen- tal tiers like the consonantal tier and vocalic tier.

Other work has found evidence that French speakers found non-adjacent con- sonantal dependencies may be easier to track and recall than vocalic dependencies Bonatti et al. (2005), though LaCross (2015) suggests that such imbalances may be due to Li biases, showing that Khalka Mongolian speakers, whose language dis- plays vowel harmony, found it easier to track non-adjacent vocalic dependencies than did American English speakers. On the other hand, Mintz et al. (2018) found that English-learning infants rapidly segmented out words from a stream where the only cues to segmentation were vowel harmony patterns within each word, despite not having vowel harmony in their native language.

The second relevant line of research is on whether there is a consonant bias towards representing lexical information and a vocalic bias towards representing syntactic/prosodic information, as suggested by (Nespor et al., 2003). If these bi- ases exist, it would be natural for words in a non-concatenative morphological grammar to have a morpheme split along consonant-vowel lines, with the conso- nantal root representing a semantic field, as it does in Semitic, and with a vocalic residue representing voice, tense, and other syntactic information, again as it does in Semitic.

Much of the psycholinguistic work done in this area focuses on the lexical side of this equation, looking at how altering consonants and vowels affects word re- trieval and identification. Some evidence comes from tasks in which participants are given a nonce word such as kebra and asked to alter it to form a known word; participants are on the whole more likely to change a vowel, for example to cobra, rather than a consonant, for example to zebra, and to alter vowels with greater ac- curacy and speed (Cutler et al., 2000). Further experimentation has demonstrated the consonant bias in infants as young as 8 months (Nishibayashi and Nazzi, 2016), though the earliest age at which infants demonstrate the bias seems to differ cross- linguistically, suggesting that it is learned rather than innate (Nazzi et al., 2016).

74 3.2 Experiment

In this artificial grammar experiment, three different "alien languages", all with the word skeleton CVCCV, were presented to participants:

(27) All words are of the form C1 V1C 2C3V2

a. Concatenative: words consisted of a C1V1C2 stem and C 3V 2 suffix (template: rrrss) b. Arabic-like non-concatenative: words consisted of a consonantal root

C 1C 2C 3 and a vocalic residue V1V2 (template: rsrrs) c. Unattested non-concatenative: words consisted of a mixed consonant-

vowel "root" C1C2V2 and mixed consonant-vowel residue V1C3 (tem- plate: rsrsr)

root root root

(la) C1 V1C2 C3 V 2 (1b) CIV1 C2 C3 V2 (ic) C1V1C 2 C3 V2 V residue residue residue

If we assume the existence of learning biases for concatenativity and, if a lan- guage has root-and-pattern-style morphology, a bias towards consonants in the root and vowels in the residue, then we would expect the concatenative language (27-a) to be the most learnable, followed by the non-concatenative language (27-b) that separates out consonantal and vocalic material, followed by the unattested non-concatenative language with mixed consonant-vowel morphemes (27-c). On the other hand, the computational model treats all three exactly the same and predicts them to be equally learnable.

3.2.1 Task

The experiment was run online using the Experigen platform (Becker and Levine, 2013). Participants were told that they would be participating in a linguistic effort

75 to make sense of the Martian language, which humans had just begun to intercept and of which little was yet known.

(28) Introductory screen to the artificial grammar experiment

Exciting scientific news from Marsl After centuries of silence, we've finally managed to intercept transmissions made in the Martian language. We don't know yet what each word means, but we're working hard to figure out more about this language.

system of Martian and hearing a few intercepted words of the language. To hear the Martian words better, please put on your headphones. LconftrwuJ Note: If pressing buttons isn't your thing, throughout this experiment, you can press 'Space' to advance.

They were then given explicit instruction on the phoneme inventory and ortho- graphic representation of Martian, consisting of six consonants and six vowels.

(29) Vowel inventory training screen

These are the six vowels of Martian we've identified so far:

ah

ll al

ee

~' el

oh

00

Move on to consonants

This was followed by three rounds of training on 36 words, each round pre-

76 sented in randomized order. They were shown the orthographic representation of the word and heard an audio representation generated with MBROLA (Dutoit et al., 1996), before advancing to the next item in their own time.

(30) Training screen

During this phase, participants were instructed not to write down the words, but encouraged to pay attention as they would be periodically quizzed on what they had heard. At frequent intervals, the participants were asked to type in the word they had just seen and heard previously, to verify that they were paying attention.

(31) Quiz screen

They were given feedback as to whether they had correctly input each word. A total of 12 quiz questions were interspersed with the 36 x 3 = 108 individual training words.

77 (32) a. Feedback screen: correct I Type in the word you just heard, then press 'Enter': kahlsah Correct! Lcontinue-

b. Feedback screen: incorrect

Type in the word you just heard, then press 'Enter': toomko The word was toomkoh. EcontinueI

After the three training rounds came the testing phase.

(33) Testing instructions

Uh oh, the Martians are springing a pop quizI They want to know how well you remember the words you listened In on earlier.

You'll see a number of words. Some you saw and some you didn't. Your job is to decide for each word whether you saw it earlier. Don't stress out - If you can't remember, just select "yes" or "no" at random. (The Martians may look tough, but they'm really softies at heart.)

Ready? Let's goI continue

78 Participants were shown 38 words in randomized order and asked to decide whether they had seen each word in the training phase. Only 4 words had been seen previously in their totality; another 15 words either shared a root, a residue, or both, with words previously seen. The remaining 19 words were distractor items, each consisting of a root and residue that had not been seen at all in training.

(34) Testing screen

Listen to the word. Is this word among the ones you heard earlier? laimsah

Ye- N

No feedback was given in this section.

The experiment concluded with an optional demographics questionnaire.

The construction of the training and testing stimuli is described further in the next section.

3.2.2 Stimuli

Phoneme inventory

The phoneme inventory of "Martian" was as follows:

(35) Phoneme inventory of Martian (orthographic representation in angular brackets)

79 Consonants Vowels /p/ (p) /a/ (ah) /t/ (t) /ai/ (ai) /k/ (k) /ei/ (ei) /1/ (1) /i/ (ee) /m/ (m) /ou/ (oh) /s/ (s) /u/ (oo)

(36) Sample Martian words, all of the form CVCCV

a. /matsi/ (mahtsee) b. /saimmei/ (saimmei) C. /kitkou/ (keetkoh)

Train and test sets

To probe learnability, this experiment systematically manipulated the frequency profiles of roots and residues.

The fundamental variables manipulated were root and residue frequency. Roots and residues in the test set (all of which occurred exactly once in the test set) could have occurred 4, 1, or 0 times in the training set.

(37) Example of a frequency-4 root: occurs 4 times in the training set and once in the test set

80 1

Train Test

R6 S15 R6 S7

R6 - S2

R6 - S5

R6 - S10

(38) Example of a frequency-1 residue: occurs once in train set and once in test set

Train Test

RiO S7 R6 S7

(39) Example of a frequency-O root: root (not numbered) occurs once in the test set but not in the train set

Train Test

R - SIO

Additionally, all of the 19 distractor items were composed of roots and residues that did not occur in the train set and were thus frequency-0.

81 (40) Distractor items contained frequency-0 roots and residues.

Train Distractor

Given the learning algorithm presented in the previous chapter, we would ex- pect that high-frequency items would be more easily learned and recalled than low-frequency items.

Furthermore, morphemes that co-occur with high-frequency items should be more easily extractable. For example, if a residue co-occurs with a high-frequency root which is easy to identify, then the word template should be easier to guess and therefore the residue itself more easily learned and recalled, even if it has itself an inherently low frequency.

The frequency-1 morphemes were therefore divided into highly extractable (HE) and less extractable (LE) morphemes. Highly extractable roots were those that co-occurred with a frequency-4 residue in the training set. Less extractable roots were those that co-occurred only with a frequency-1 residue in the training set.

(41) Example: and occur once in the train set and once in the test set

each. However, (7 occurs with a frequency-4 root RIO in the train set

and is thus highly extractable, while S occurs with a frequency-1 root ® in the train set (unnumbered because R does not appear in the test set), and is thus less extractable.

82 Train Test

RIO S7 R6 S7

R8 S9

The third variable manipulated was word frequency. Entire words in the test set could have occurred once or zero times in the train set.

(42) Example: root R2 has the same frequency profile as root R6 in (37), but the test word it occurs in also occurs in the train set.

Train Test R2EI3AI3 R2 R2 - SI

R2 S8

The roots and residues are given numbers as they stand for abstract roots and residues whose phonological content differs in each of the three conditions.

(43) Example: R2 is realised as CVC (sohp) in Condition (27-a), CCC (ktp) in Condition (27-b), and as CCV (mtoo) in Condition (27-c) (not shown).

83 English-like Arabic-like Train Test Train Test

FojpT -h E Eoo FE FIH. .HE. . EqH

Thus, all train and test sets share exactly the same combinatorial "backbone". Any given frequency-4 root occurs with the four equivalent residues in each of the three conditions. The full train/test set with its various realisations is given in Appendix B. A further design consideration was that roots and residues were symmetric.

(44) Roots and residues are fully symmetric: R2 has the same profile as S2, except that R2 is a root and S2 a residue.

Train Test Train (cont'd) Test (cont'd) R31-S2 EI-tI EI-E EHE J-E

Lastly, the test set crossed all licit combinations of the five factors. Some com- binations were impossible to test: a word with frequency 1 or 4 could not possibly contain a root or residue with zero frequency. There were no low-extractability frequency-4 morphemes as all frequency-4 morphemes co-occurred with three other frequency-4 morphemes in the train set, and were thus considered high-extractability.

84 (45) Design of the test set

Item # Word frequency Root frequency Residue frequency 1 4 4 2 4 1 HE 3 1 HE 4 4 1 LE 1 LE *0-0 5 4 4 0-0 6 4 1 HE 7 4 1 LE 0-0 8 4 0 0-0GE- 9 1 HE 4 0-0 10 1 LE 4 11 0 4 12 1 HE 1 HE 0-0 13 1 HE 1 LE 0-0 14 1 LE 1 HE 0-0 15 1 LE 1 LE 0-0 0-00-* 16 0 1 HE 17 0 1 LE 18 1 HE 0 19 1 LE 0 20-38 0 0 0 (distractors)

3.2.3 Controlling for alternative testing strategies

Besides the MCMC sampling-based learning algorithm presented in the previous chapter for identifying templates and morphemes, it was entirely possible that par-

85 ticipants would use alternative or additional measures of wordlikeness to answer the question of whether a test word had been observed in the training set. These possible confounds were controlled for to the extent permitted by the experiment design.

Phonotactic probability

Phonotactics are an important part of the linguistic knowledge of a speaker and have been shown to be instrumental in the performance of several psycholinguis- tic tasks surrounding unknown words, such as acceptability judgments of novel wordforms, the identification of unknown words in noise and embedded in longer sequences, etc. (Greenberg and Jenkins, 1964; Vitevitch et al., 1997; Frisch et al., 2000, e.g.). A participant approaching the experimental task of determining whether a test word had previously been observed, might use their accumulated phonotactic knowledge of Martian from the corpus of training words observed to make their judgment. In this experiment, the measures of phonotactic probability considered were positional unigram frequency and bigram frequency (Jusczyk and Luce, 1994; Vite- vitch et al., 1997). The training set was set up such that each consonant and vowel appeared ex- actly 6 times in each C and V slot in the set of 36 words. Therefore, the phonotactic probability of each word, based on positional unigram frequencies, was equal for every training and testing word. Balancing the bigram frequencies was trickier given the specific combinations of roots and residues required by the design of the training set and the shapes of the words. For instance, by necessity, a frequency-4 root in the English-like (27-a) condi- tion would have a C 1V 1 frequency of 4 and a V1C 2 frequency of 4. On the other hand, at the syllable boundary between C 2 and C 3, there was more flexibility to introduce a greater number of combinations and so every consonant-consonant

86 combination could be, and was, included exactly once in the training set for this particular condition. To control for the bigram frequency confound, therefore, every test item was balanced with a distractor item with exactly the same overall bigram frequency.

(46) Example bigram frequencies for two balanced words

4 4 1 1 4 1 1 4 /YVV \ /VV\/Y\ Test word: s oh p p ah Distractor: p ai m I ee

Bigram probability was then included as a factor in the statistical analysis to ac- count for any bias towards selecting a word as having been recalled due to phono- tactic probability.

Lexical Neighbourhood

Another strong correlate to wordlikeness judgments is lexical neighbourhood den- sity (Greenberg and Jenkins, 1964; Bailey and Hahn, 2001). The more existing words a novel wordform resembles, and the closer it resembles them, the more likely it is to be judged a typical-sounding word of that language. I used the Bailey-Hahn Generalized Neighborhood Model (GNM) as the met- ric for lexical neighbourhood density, which takes into consideration the similarity of the novel wordform to every word in the lexicon. In the context of this experi- ment, the "lexicon" was the set of words in the training corpus. The Bailey-Hahn GNM model further weights each lexical item by frequency. However, in this experiment, every training word occurred with exactly the same frequency. Therefore, the model reduces to the Generalized Context Model (GCM) of Nosofsky (1986), on which the GNM was based:

(47) Si = Te-Ddi,, where

a. Si is the similarity of lexical item i to the lexicon {j}; b. D is a sensitivity parameter, value to be empirically determined;

87 c. dij is the phonetic edit distance between lexical item i and a word from the lexicon j

The phonetic edit distance dij was, in this case, straightforward to compute as all words were of the form CVCCV: it was simply the sum of the phonetic distances of each corresponding segment, using the natural class-based distance metric of Frisch et al. (1997) as implemented in Adam Albright's segmental similarity calcu- lator,2 taking distance as (1-- similarity).

(48) Sample calculation

m ah t 1 ah 0 0 .69 .81 .77 d = .69 + .81 +.77 = 2.27 m ah p s 00

Since no insertions or deletions were considered, there was no need to deter- mine an insertion or deletion cost and the substitution costs could be computed directly from the natural classes. However, there remained the questions of what the sensitivity parameter D should be set to, as well as whether the distances (computed as number of shared natural classes divided by number of shared + unshared natural classes) should be computed based on the natural classes of the English segment inventory, or the limited segment inventory of the "Martian lan- guage" introduced in this experiment.3 To determine the optimal setting of these parameters, I calculated the Pearson correlation coefficient between the mean Si of a test item in relation to the lexicon of training items seen in a particular condition, and its mean acceptance rate in that condition, for a variety of sensitivity parameters D and over the two possible sets of natural classes induced by the English and Martian phoneme inventories. The highest correlation was found with a D of 1.1 using natural classes from the English phoneme inventory.

2 Available at web.mit.edu/albright/www/software/SimilarityCalculator.zip 3 The features considered were the same in both cases; however, different segment inventories induced different natural classes.

88 (49) Pearson correlation between similarity to the lexicon and mean acceptance rate of a test item, versus sensitivity parameter D for English and Martian natural classes

0.5 - Martian natural classes English natural classes 0.48 - '' -.- p 0 0O.46

0.44 * U 0.42

0.4 1 1.5 2 2.5 3 D

These parameters were used in computing the GCM/GNM similarities and these similarities included as factors in the statistical analysis, giving neighbor- hood density the highest chance possible of explaining away the participants' re- call judgments.

Resemblance to English words

Finally, participants might particularly recall or simply identify as having seen words that bear an orthographic or phonetic resemblance to existing English words, such as the "Martian" word (meekpee), each syllable of which is an existing En- glish word. To counteract this, 8 random permutations of consonants and vowels were used in each of the three conditions. The word (meekpee) in one permutation might become (kahlsah) in another permutation, for example, reducing the risk of any one word representing a particular combination of variables being identified as having been seen at a higher rate due to its chance resemblance to an English word.

89 (50) Example permutation

p t k 1 m s ah ai ei ee oh oo

s t 1 p k m ai ei ee ah oh oo

Consonants Vowels

m ee k p ee

k ah 1 s ah

Sample Word

3.2.4 Participants

The participants in this experiment were 120 users of Mechanical Turk, 40 per con- dition. There were 8 permutations of consonants and vowels per condition, and therefore each permutation was seen by 5 participants. Participants were remunerated for their time and effort. On average, partici- pants took 24 minutes to complete the experiment. All participants were self-reported native speakers of English. Several reported speaking other languages besides English. One spoke Hebrew and another spoke Arabic and Hebrew, languages with the non-concatenative morphology targeted in condition (27-b). However, as neither of these two participants were shown this condition, it was unlikely that their linguistic knowledge could have affected the outcome of this experiment, and therefore neither was eliminated on this basis. In order to be included, participants had to score at least 10 out of 12 correct on the quiz questions sprinkled into the training section. Some allowance was made for typographical errors by computing the Damerau-Levenshtein distance between the word entered by the participant and the desired orthographical form. The answer was counted correct so long as this distance was not greater than 1. That is, a single substitution, addition, deletion, or transposition between two adja- cent letters was allowed. The most typical error made was misspelling the vowels,

90 e.g. *(matsai) for (mahtsai). After excluding participants with more than the permitted number of errors, 101 participants remained, with the following distribution between conditions:

(51) Distribution of included participants per condition

Condition Participant count Concatenative (27-a) 32 Arabic-like non-concatenative (27-b) 36 Unattested CV-mix non-concatenative (27-c) 33

3.3 Results

3.3.1 Descriptive analysis

First we'll establish a base level for the rate of acceptance of distractor items. These were words for which root, residue, and word frequency were all zero.

(52) Rate of acceptance for distractor items in the three conditions

I

(la) rrrss 0.45

(1b) rsrrs - 0.53

(1c) rsrsr - 0.46

0 0.2 0.4 0.6 0.8 1 Acceptance rate

We find that the acceptance rate hovers around 50%, suggesting that the par- ticipants were at chance for guessing whether a genuinely unknown word was Martian or not.

91 Now let's examine the effect of root and residue frequency on the rate at which words were accepted as having been "Martian".

(53) Rate of acceptance for test words of root frequency 0, 1, and 4 in the three conditions.

= Frequency 0 root (la) rrrss 0.57 _ - Frequency 1 root 0.88 - Frequency 4 root

0.54 (1b) rsrrs 0.61 0.78

0.44 (1c) rsrsr 0.65 0.86 I I I 0 0.2 0.4 0.6 0.8 1 Acceptance rate

As predicted by the model, the rate of acceptance increases as we move from frequency-0 to -1 to -4 roots, in all three conditions.

(54) Rate of acceptance for test words of residue frequency 0, 1, and 4 in the three conditions.

0.68 =Frequency 0 residue _ Frequency 1 residue (la) rrrss u.61I 0.74 Frequency 4 residue

0.51 (1b) rsrrs 65 0.73

~~10.73 (1c) rsrsr I.68 .68 0 0.2 0.4 0.6 0.8 1 Acceptance rate

In the case of residue frequency, there appears to be an increase in acceptance

92 rate for words containing frequency-4 residues in the English-like condition (27-a) and Arabic-like condition (27-b), but no consistent pattern from frequency-0 to frequency-1 residues.

The next variables to be manipulated were root and residue extractability. The two graphs following include only words with roots and residues of frequency 1, respectively, as frequency-1 morphemes were the only ones for which high and low extractability were meaningfully defined.

(55) Rate of acceptance for test words of low and high root extractability in the three conditions.

= Low extractability root (la) rrrss 0.53 _ High extractability root 0.60

(1b) rsrrs 0.65 0.57

10.64 (1c) rsrsr 0.67

0 0.2 0.4 0.6 0.8 1 Acceptance rate

The relative ordering of acceptance rates of low and high extractability roots in the three conditions switches between conditions, suggesting that high extractabil- ity does not increase acceptance rate.

(56) Rate of acceptance for test words of low and high residue extractability in the three conditions.

93 c Low extractability residue (1a) rrrss -0.56 _ High extractability residue 0.58

(1b) rsrrs 0.53 0.68

(1c) rsrsr 0.67 0.64

I I 0 0.2 0.4 0.6 0.8 1 Acceptance rate

Similarly, we see that the ordering of low and high extractability also switches between conditions for high and low extractability residues, contra the computa- tional model.

The final variable was word frequency. The following graph shows the rate of acceptance for words that had been seen in their totality in the train set, versus words that had not, but for which both morphemes had been observed in the train set.

(57) Rate of acceptance for words not in train set versus words in train set

I II I I Not training word (la) rrrss - IZI0.60 - - Training word 0.88

(1b) rsrrs 0.66 0.83

(1c) rsrsr 0.67 0.86

I I I I 0 0.2 0.4 0.6 0.8 1 Acceptance rate

We see that there is a robust effect of having seen a word before on acceptance

94 rate, as would be expected in any theory of word recognition.

3.3.2 Statistical analysis

The descriptive analysis above showed support for some aspects of the model and not for others. To properly tease out the effects of the variables in question, and de-confound them from the variables of phonotactic probability and lexical neigh- bourhood density, I performed a mixed effects logistic regression analysis using the R library lme4 (Bates et al., 2015). Each trial consisted of a response by a participant in the test phase to the ques- tion whether the word was among the words in the train set that the participant had viewed earlier. The dependent variable was their binary yes/no response to this question. The main effects tested were the following:

(58) Experimental condition, Helmert-coded into two variables: one reflecting whether the condition is concatenative, and one as to whether it groups consonants into one morpheme and vowels into another, as in Arabic (cv).

concatenative cv English-like (27-a) 2 0 Arabic-like (27-b) -1 1 Unattested-like (27-c) -1 -1

Our expectation based on the computational model was that neither of these, nor their interactions, would have a significant effect.

(59) Root and residue frequency of the test or distractor item, Helmert-coded into two variables each: one reflecting whether the morpheme was seen earlier (root-attested, res-attested) and one reflecting whether it was high or low frequency in the train set (root-f requency, res-f requency).

95 r-attested r-frequency Frequency-0 -2 0 Frequency-1 1 -1 Frequency-4 1 1

Our expectation was that morpheme frequency and attestedness should both have positive effect on item recognition.

(60) Root and residue extractability in the test or distractor item, dummy-coded.

r-extract High 0.5 Low -0.5

Our expectation was that morpheme extractability should have a positive effect on item recognition.

(61) Whether the word as a whole was attested in the train set, dummy-coded.

word-attested

Yes 0.5 No -0.5

Our expectation was that word attestedness should have a positive effect on item recognition.

The two confound variables were also included as main effects and coded in the following way:

(62) a. Bigram frequency bigram, re-centred around 0 and scaled by standard deviation.

b. Lexical neighbourhood density neighbourhood, re-centred around 0 and scaled by standard deviation.

Our expectation was that bigram frequency and lexical neighbourhood density should have a positive effect on item recognition.

96 Two- and three-way interactions of the main effects were included, up to the maximal model that would converge, and random effects were also included for participant and item. The result was the following mixed effect model with binomial logit link func- tion:

(63) response - ((concatenative + cv) * (word-attested * (root-frequency + residue-frequency) + neighborhood)

+ root-attested + residue-attested

+ root-extract + residue-extract + bigram

+ (i user) + (11 item) which was fit with the following coefficients:

(64) Fitted model coefficients (p < 0.05 in bold):

Main effects Coefficient Standard error P(>I z 1) Condition

concatenative English-like vs rest 0.011 0.095 0.90

CV Arabic-like vs unattested 0.05 0.15 0.75 Confounds

neighbourhood 0.60 0.10 3.25e-09 bigrarn -0.11 0.054 0.055 Morpheme frequency

root-attested Ovs rest 0.12 0.071 0.098

residue-attested Ovs rest 0.017 0.058 0.77

root-f requency Ivs4 0.29 0.12 0.017 *

res idue-f requency I s 4 0.12 0.11 0.28 Others

word-attested Ovs 1 0.41 0.24 0.089

root -extract ability Low vs High -0.042 0.17 0.80

97 res-extractability Low vs High -0.0027 0.18 0-99

Two-way interactions Coefficient Standard error P(>I z I) Condition x Confounds concat:neighbourhood 0.066 0.049 0.18 cv:neighbourhood -0.064 0.073 0.38 Condition x Morpheme frequency concat:root-frequency 0.071 0.073 0.33

concat:residue-frequency 0.15 0.072 0.035 * cv:root-frequency 0.036 0.11 0.75

cv:residue-frequency 0.25 0.11 0.025 * Others concat:word-attested -0.0060 0.18 0.97 cv:word-attested -0.040 0.26 0.88 root-freq:word-attested -0.17 0.20 0.39 res-freq:word-attested 0.11 0.56 0.58

Three-way interactions Coefficient Standard error P (>I z I) concat:word-attested:root-freq 0.049 0.14 0.36 cv: word-attested: root-f req 0.037 0.22 0.87 concat:word-attested:res-freq 0.19 0.14 0.17 cv: word-attested:res-f req 0.39 0.22 0.074

There were four significant effects. First, a test word with higher lexical neigh- bourhood density was more likely to be recognised (p < 0.001). Second, a test word with higher root frequency was more likely to be recognised. Third, words containing higher-frequency residues in the concatenative language were more likely to be recognised than their equivalents in the non-concatenative languages (p < 0.05). Fourth, words containing higher-frequency residues in the Arabic-like language were more likely to be recognised than their equivalents in the unattested

98 language (p < 0.05).

3.4 Discussion

Based on the outcome of the mixed effects model, the factor with the greatest effect on whether a word was accepted or not was lexical neighbourhood density (p < 0.001). This is not surprising, firstly since lexical neighbourhood density is highly cor- related with wordlikeness judgments (Greenberg and Jenkins, 1964; Bailey and Hahn, 2001), but also because many of the other factors in the model: root and residue frequency and attestedness, word attestedness, and bigram probability, correlate highly with neighbourhood density. For example, if a test word had been seen in the train set, then its distance to a member of the lexicon was 0, and its lexical neighbourhood density subsequently higher. Similarly, having a high-frequency root would result in 3/5 of its segments being wholly shared with four members of the training lexicon. Nevertheless, neighbourhood density alone could not account for the whole story. Word frequency still makes a small though insignificant contribution (p = 0.089) and root frequency (4 vs 1) a significant positive contribution (p < 0.05) to word acceptance, regardless of experimental condition. Only morpheme fre- quency (4 vs 1), as opposed to morpheme attestedness (1 vs 0), had significant ef- fects throughout the model, possibly because far more repetitions were necessary for the existence of the root or residue to stick in memory. So far, the computational model's predictions have more or less held up. How- ever, ignoring the predictions of the computational model and going by typologi- cal frequency instead, we might expect that the concatenative morphology would have been the easiest to break down into morphemes and learn, over the non- concatenative morphology that conforms to the consonant-vowel distinction we find in Semitic and pockets of English such as ablaut, over the wholly unattested morphology. Root and residue attestedness and frequency should thus have a

99 greater positive effect on word acceptance in condition (27-a) vs condition (27-b), and condition (27-b) vs condition (27-c).

This was the case with residue frequency (concat:residue-frequency: p < 0.05, cv:residue-frequency: p < 0.05). This contradicts the predictions of the computational model, which treats all three "languages" identically.4

Also contrary to the predictions of the computational model, root and residue extractability did not have any significant effect on word acceptance. This suggests that either the model is flawed, or possibly that human memory limitations may muddle the effects of extractability. Though a high-frequency root might make it easier to extract the word template, it does not automatically imply that the low- frequency (though highly extractable) residue would correspondingly be easier to memorise.

Since the participants in this experiment were adult speakers of English, it could be that the proposed concatenative bias is not inherent to all learners, but was induced by their Li knowledge and preference for concatenative morphology, since the dominant mode of word-formation in English is concatenative, with only a small amount of ablaut in Germanic past tenses and plurals. However, Drake (2018) shows that Arabic speakers in a wug-style test also display a preference for producing concatenative over non-concatenative morphology. (Recall that Ara- bic employs concatenative inflectional morphology in addition to templatic stem- internal morphology.) An interesting follow-up to this experiment would repeat it on speakers of Arabic and Maltese to see whether such speakers do or do not display the concatenative bias in this setting, and to what degree.

4These factors were significant when calculated using the Wald statistic. An alternative method for testing for significance is likelihood ratio tests, removing a factor from the model and computing its ANOVA with respect to the fuller model. However, these require removing all higher-level in- teractions involving that factor. When the concat: residue-frequency factor and its higher-order three-way interactions were removed, an ANOVA gave a non-significant result. Similarly with cv: residue-f requency. It seems likely that the three-way interactions, though not in themselves significant, are the cause of this difference between the results of the Wald test versus likelihood ratio tests.

100 3.5 Conclusion

In the previous chapter, we saw that the computational model successfully learned to break down Arabic verbs into roots and residues. However, it also predicted that typologically, non-concatenative languages should be dominant, and that root- and-pattern morphologies should allow mixing consonants and vowels. Prosodic Morphology makes similar predictions. However, this is not the case. I therefore proposed that certain biases needed to be incorporated into the model, and tested them via the artificial grammar experiment above. This artificial grammar experiment demonstrated (modulo the null results of the likelihood ratio test) that English speakers exposed to a concatenative lan- guage are more likely to successfully break down the words into morphemes and recognise new words containing those morphemes than in non-concatenative lan- guages, and moreover that having consonants concentrated in roots and vowels concentrated in residues aided the participants in breaking down and recognis- ing new words, backing the existence of a synchronic bias towards concatenativity and, when that bias fails and a language does have root-and-pattern morphology, a synchronic bias towards having consonantal roots with interleaved vowels. In the next chapter, we will look at ways to incorporate these results into the model.

101 102 Chapter 4

Towards a new model

In this chapter, I look at how to improve the model introduced in 1.3.5, incorpo- rating the results of the artificial grammar experiment in Chapter 3, as well as how to extend the model to incorporate standard word formation operations, such as affixation and spreading of vowels and consonants. These will serve as pointers for future research.

4.1 The concatenative bias

The model proposed in 1.3.5 dealt only with the stem, which could be mono- morphemic or bimorphemic depending on whether the slots chosen were all r slots or a mixture of r and s slots. This model can be combined with the GGJ model by additionally generating any number of affixes and concatenating them to the stem, as described below. This would handle both English-like languages, where a (usu- ally) monomorphemic stem is combined with affixes, and Arabic-like languages, where a stem consisting of a root and vocalism is further combined with deriva- tional and inflectional affixes to produce a full word. The original model had, if anything, a bias towards generating non-concatenative stems. However, both the distribution of non-concatenativity in the typology of the world's languages, and the results of the artificial grammar experiment in Chapter 3, suggest that the model should be biased towards concatenativity.

103 4.1.1 Implementing the concatenative bias in the model

Currently, the key point at which the model "decides" whether something is mono- morphemic or bimorphemic is when it decides on whether each slot belongs to the root or residue. If all the slots chosen are r, then it is monomorphemic; otherwise it is bimorphemic. One way to modify this model is to have a choice point before this, determining whether the stem to be generated is monomorphemic or bimorphemic. If the latter, then we proceed as in the current model, choosing r and s slots; otherwise, the stem generation will proceed as in the original GGJ-derived model presented in . 2.1.1. In concatenative languages, the probability Pmonomorphemic of generating a mono- morphemic stem will be high; in non-concatenative languages, it will be low. There- fore, there should be another choice point at the beginning of the generation, decid- ing the value of Pmonomorphemic for the entire language. The bias for concatenativity is equivalent to a bias for this probability to be high.

Pmonomorphemic can be selected from a beta distribution, the conjugate prior probability distribution of the Bernoulli distribution. The beta distribution is gov- erned by two shape parameters, a and b. Setting a to be low and b to be high, we obtain a suitable distribution for drawing the probability Pmonomorphemic:

(1) Draw Pmonomorphemic from BETA (1, 10).

10

8

6

4

2

0 - 0 0.2 0.4 0.6 0.8 1

This ensures that any inference over this enhanced model will be more likely to spend time exploring concatenative segmentations of the word into morphemes. If the language is non-concatenative in nature, the result of attempting a concate-

104 native segmentation will result in a large, improbable morpheme lexicon, which would then favour exploring the non-concatenative area of the posterior probabil- ity distribution. If the language is concatenative, however, the morpheme lexicon and segmentation will be reasonably probable, and so the majority of the posterior probability distribution will be occupied by concatenative segmentations.

To summarise, then, the new model has the following steps:

(2) a. For the whole language, pick a value of Pmonomorphemic- b. For each word, pick the number of morphemes that will go into this word. c. Generate a stem. (i) With probability Pmonomorphemic, choose whether this stem will be monomorphemic. (ii) If monomorphemic, draw an existing monomorphemic stem from the lexicon, or generate a morpheme of desired length. (iii) If bimorphemic, draw or generate a template, root, and residue as in the model in 1.3.5. Interleave them to form the stem. d. For each of the remaining morphemes: (i) Decide whether it is to be a prefix or suffix with some probability

Psuffix- (ii) If prefixal, draw an existing prefix from a prefix lexicon, or gener- ate a new prefix. (iii) If suffixal, draw an existing suffix from a suffix lexicon, or gener- ate a new suffix. (iv) Concatenate this affix onto the current word.

This was not the only choice of models to be made. For example, we could have directly modelled the number of times the stem switches from root to residue and back to root. If that number is 0, the stem is monomorphemic. If 2, then the stem has an ablaut-like form, such as goose-geese. If more, then it is of the Arabic root-

105 and-pattern type.1 The number of switches could be drawn from a Poisson dis- tribution. Concatenative languages would lean towards a very small number of switches, while a non-concatenative language would lean towards a larger num- ber of switches.

4.1.2 Implementing the concatenative bias in Optimality Theory

In 3.1.1 we saw that within Optimality Theory, because there was only a single constraint CONTIGUITY militating for concatenation, the factorial typology pre- dicted many more non-concatenative morphologies than are observed in the world's languages. In order to implement the concatenative bias in Optimality Theory, one can simply posit that CONTIGUITY is by default high-ranked, in which case a great deal of linguistic evidence would need to be displayed to the learner in order to demote the constraint to the lower ranking required for non-concatenative morphology.

4.1.3 Implementing the concatenative bias in MaxEnt models

In a MaxEnt model of phonology (Goldwater and Johnson, 2003; Hayes and Wil- son, 2008), CONTIGUITY could have a prior forcing it to be high by default. This is enforced by a regularization term over the whole model of the following form:

(3) , where

a. wi is the weight assigned to constraint i by the model b. pi is the target a priori weight c. oi governs how easily the assigned weight can differ from the target prior weight

0 A high value of PCONTIGUITY and a low value Of CONTIGUITY would force the as-

1A lingering question here is what would happen if the number of switches drawn is 1, implying that the stem is bimorphemic but of a concatenative nature: do we draw the affix from the prefix or suffix lexicon, or model it as being drawn from a separate residue lexicon?

106 signed weight of CONTIGUITY to be high, again unless setting it low yields a prob- ability distribution over output candidates that better approximates the observed distribution of candidates - in other words, if the data demands that CONTIGUITY be set low.

4.2 CV bias

The current computational model has no innate conception of consonants and vowels. Therefore, once we have chosen to have a bimorphemic stem, a root of the form, say, /CCV/ interdigitating with a residue of the form /VC/ is entirely plausible, despite such systematic intercalations of morphemes containing a mix- ture of C's and V's being, to my knowledge, unattested. Furthermore, as we saw in 3.1, the model was more likely than not to generate morphologies of this sort, rather than attested Arabic-like morphologies with a consonantal root interleaving with a vocalic residue. The artificial grammar experiment of Chapter 3 asked whether learners would have more difficulties learning a language with intermixed consonant-vowel roots and residues, versus the Arabic kind. We found that indeed they did, which I interpret as pointing to the existence of a bias in templatic morphology for roots to be consonantal and interleaved with vocalic material. Before we see how to incorporate such a bias into the computational model, let us first address the question of whether Semitic roots are, indeed, fully consonan- tal. So far in this study, I have entirely ignored the question of the so-called weak roots, which traditionally have been analysed as containing glides in one of their consonantal positions, which are dropped and/or replaced with vowels in some forms of the stem. For example:

(4) a. Assimilated roots (Cl semi-vocalic): Vwtd (i) wadcad "he appeared" (ii) ya-ctid "he appears"

107 b. Hollow roots (C2 semi-vocalic): Vmwt (i) ma:t "he died" (ii) ya-mu:t "he dies"

(iii) mi:t-a "manner of death" c. Defective roots (C3 semi-vocalic): /rmy (i) ramay-tu "I threw" (ii) rama: "he threw"

Under the word-based school of thought in which words are primary and roots are non-existent or irrelevant, the surfacing of the glides in certain forms is simply the result of a phonological requirement that a consonant be present in a particular slot in a template, which is fulfilled by the formation of a glide (Ratcliffe, 1998, p. 63). However, there is some evidence that the glides are present in the underlying representation. One major source of evidence is ZT, a bilingual French-Arabic- speaking individual with aphasia who occasionally metathesises root consonants (Idrissi et al., 2008). When given a word without a surface glide but containing a root with an underlying glide, ZT sometimes reads the word aloud with a surface glide due to metathesis.

(5) Examples of ZT producing surface glides under metathesis (Idrissi et al., 2008)

Target word (Underlying root) Word produced by ZT (Metathesised root) xafa:? (Vxfy) xa:yif (Vxyf) qa:T (fqwT) wa:qiT (fwqP)

In the same paper, Idrissi et al. (2008) display more evidence from Arabic hypo- coristics on names, as well as ZT's mistakes in reading aloud Arabic hypocoristics, showing that even though names are commonly assumed to be unanalysed stems, the process of forming hypocoristics in Arabic seems to require the extraction of a root and its insertion into a hypocoristic template, which in some cases surfaces a

108 glide that might not be present in the underlying name. The situation is murkier in General Modern Hebrew, where, if one assumes that roots are morphemes, there is a class of roots that may be posited to contain the un- derlying vowel /a/, such as Vfva. The segment in this position never surfaces as a consonant, instead surfacing as the vowel [a] or being elided. While some analy- ses (e.g. Bar-Lev (1977)) assume the presence of an underlying pharyngeal (which did exist historically and also in the related dialect of Sephardic Modern Hebrew), Faust (2005); Pariente (2012) argue that it is unreasonable to expect a General Mod- ern Hebrew speaker to posit the existence of an underlying pharyngeal when no pharyngeals ever surface in their dialect. A similar situation is found in Gurage, where Prunet (1996) argues that /a/s in roots are functioning as guttural conso- nants. For now, let us posit that a language will prefer to have roots that are con- sonantal, and residues that are vocalic - a tendency rather than an absolute. A segmentation that favours such a distribution will have a higher probability than one with mixed C-V roots and residues, but when the data demands it, a mixed root or residue can be generated. This would also help us account for certain examples of English ablaut. For example, in stand-stood, the residue in stand can be a vowel and consonant an. In a word like forgot, if we choose not to modelfor- as a derivational prefix and instead count it as part of the root, then the initial vowel in forgot can be included in the root.

4.2.1 Implementing the CV bias in the model

Implementing this preference into the hierarchical Bayesian model is fairly simple. It requires a division of the segment inventory into consonants and vowels, and specifying a multinomial distribution rather than a uniform distribution over seg- ments. When a new root is created, the multinomial distribution draws consonants at a much higher probability than vowels; mutatis mutandis, when a new residue is

109 drawn, its segments are preferentially drawn from the vocalic inventory. An interesting question here is whether this is an over-hypothesis that can and should be acquired, and its connection with the postulated existence of a conso- nant and vowel bias, as discussed in 3.1.4 (Nespor et al., 2003). (Kemp et al., 2007) describe how over-hypotheses such as the shape bias, which is hypothesised to guide children's lexical learning by biasing them towards generalising object names to other objects with the same shape, can be implemented and learned in hierarchical Bayesian models much like the ones we have been using in this dis- sertation, showing that a very small number of examples is sufficient to guide the acquisition of these high-level biases. It may seem counter-intuitive to seek to implement this as a learnable bias, given our finding that a CV-bias seems to be crucial to preventing pathological over-generation in the typology. However, the consonant and vowel bias literature have discussed possible mechanisms for the postulated biases, such as differences in the perception of consonants and vowels (e.g. Floccia et al. (2014)), and the ratio of consonants to vowels in the phonetic inventory (e.g. Van Ooijen (1996)), which is cross-linguistically generally tipped towards consonants. Joint modelling of the acquisition of phonetic inventories, phonetic categories, and morphological seg- mentation within the context of a hierarchical Bayesian model may demonstrate that this is a learnable over-hypothesis that tips towards being near-universal bias due to these phonetic factors.

4.2.2 Implementing the CV bias in Optimality Theory

Implementing the CV bias directly into OT is problematic as it has the form of a Morpheme Structure Condition (MSC) (Halle, 1959; Stanley, 1967; Chomsky and Halle, 1968), which is incompatible with OT's fundamental principle of the Rich- ness of the Base (Prince and Smolensky, 1993), which states that all inputs are pos- sible, and thus there should be no restriction on possible morphemes. Further- more, this MSC would have to be conditioned on the status of the morpheme as a

110 root or vocalism. The CV bias has an additional odd characteristic in that its markedness con- ditions are the opposite of the desired surface form. Languages in general desire adjacency between consonant and vowels, seeking to minimise consonant clusters and vowel hiatus. However, the CV bias requires roots to be all-consonantal and vocalisms to be all-vocalic, whereas most MSCs suffer from the Duplication Prob- lem (Kenstowicz and Kisseberth, 1977), where the phonotactic restrictions of the MSC are duplicated by phonological rules. For example, in Tonkawa, there are no morphemes with a CCC cluster; sepa- rately, there exists a vowel deletion rule that fails to apply when a cluster would result (Kenstowicz and Kisseberth, 1979). Both of these phenomena can be ex- plained in OT by an undominated *CCC constraint, without a need to reference MSCs. It is unclear what the path forward should be, should we seek to formally inte- grate this proposal with OT: should this be taken as a counterpoint to arguments against MSCs due to the Duplication Problem, joining recent arguments that have been made to re-incorporate MSCs into OT (see, e.g., Rasin and Katzir (2014) for an argument from learnability, and Rasin (submitted) showing that MSCs can ac- count for non-derived environment blocking)? Alternatively, if we return to there being more fundamental mechanisms at play in terms of the consonant and vowel bias, is this something external to the morphophonology altogether?

4.2.3 Biases: Implications and Future Work

The existence of a concatenative bias implies that a non-concatenative grammar may be reanalysed as concatenative under conditions of insufficient evidence. This might be language-wide, as is arguably the case in Maltese, a Semitic language that has been relatively cut off from the Arab world since 1300, with its linguistic and political influences coming from the Romance world and, more recently, English (Hoberman and Aronoff, 2003). Maltese, Hoberman & Aronoff argue, "may con-

111 tain relics of root-and-pattern morphology, but its productive verbal morphology is affixal" (p. 63). We can model this process of language change by supplying increasingly fewer words of the Arabic sort and more of the Romance sort and modelling where the "tipping point" occurs, for different values of Pmonomorphemic, and matching these to the historical record or via psycholinguistic experiments. Alternatively, this bias might apply to smaller pockets of morphology, such as the various ablaut patterns of English, most of which are unproductive (Berko, 1958) and some marginally productive (Pinker, 1999). The prediction of the new model in (2-d) is that it would take more examples to learn to generalise an ablaut pattern compared with a comparable affixal rule. When few examples of an alter- nation such as goose-geese are available, the model would preferably model these as monomorphemic. We can test these predictions through artificial grammar experiments with lex- icons containing differing numbers of words displaying concatenative versus non- concatenative morphology, as Schuler et al. (2016) does in seeking experimental support for the Tolerance Principle (Yang, 2005, 2016), which has a predicted tip- ping point derived from a model based on the time complexity of forming a pro- ductive rule versus storing lexical items individually.

(6) Tolerance Principle: a rule R defined over a set of N items is productive if and only if e, the number of items not supporting R, does not exceed 6N =l~J (Yang, 2016)

Though this principle is formulated in terms of negative rather than positive ev- idence, it is easy to re-state it in terms of items supporting R. However, it says nothing about theform of the particular rule R. Schuler et al. (2016) tested only concatenative regular forms and concatenative exceptions, such as:

(7) a. Regular rule: add -ka.

112 b. Irregular exceptions: add -po, -tay, -lee, -bae, -muy, or woo.

An interesting follow-up would be to test the same with the following different artificial languages as conditions:

(8) a. Non-concatenative "regular" forms + concatenative exceptions b. Concatenative "regular" forms + non-concatenative exceptions c. Non-concatenative "regular" forms + non-concatenative exceptions

This would give further experimental evidence for or against the concatenative bias, and, if a bias is indeed found, allow us to more precisely estimate the magni- tude of this bias.

Regarding the CV bias, more work needs to be done to establish the nature of the bias. The experiment in Chapter 3 indicated that mixing consonants and vowels in intercalated roots and residues led to impaired learning, but where did the difficulty truly lie - in roots that contain vowels, or in residues that contain consonants, or specifically in the admixture of both? Establishing this could fur- ther elucidate its connection to the consonant and vowel bias, and how the model should be modified to incorporate it. We can then go on to explore predictions made by the enhanced model.

4.3 Spreading and truncation

Next, let us address the final part of the McCarthy (1979, 1981) model that remains outstanding: spreading, which applies to both consonants and vowels, and trun- cation, which applies to extra-long consonantal roots. In our current model, the words /katab/ and /taka:tab/ have different vo- calisms /a/ and /aa(a)a/, while their passive forms /kutib/ and /tuku:tib/ have different vocalisms /ui/ and /uu(u)i/. However, this is more elegantly modelled by McCarthy as being due to left-to- right spreading of the vocalism (after application of a special rule that anchors the

113 final /i/ of the passive to the final vowel of the CV-skeleton):

(9) Spreading of vocalism /a/ in Forms I and VI (McCarthy, 1981)

t

k t b k t b I I I I I

a a

(10) Spreading of vocalism /ui/ in Forms I and VI (McCarthy, 1981)

t

k t b k t b

CV CV C CV CVV CV C

u i u i

The same thing occurs with the consonantal root. The second binyan, or Form II, has the consonantal skeleton /CVCCVC/, which has the same form as the first quadriliteral binyan QI. The following diagrams show how the consonantal slots in /CVCCVC/ are filled when the root is biliteral, triliteral, or quadriliteral:

(11) Biliteral root s/sm filling Form II CV-skeleton to form /sammam/

s m CV C CV C a

(12) Triliteral root V/ktb filling Form II CV-skeleton to form /kattab/ (McCarthy, 1981)

114 k t b C V C C V C a

(13) Quadriliteral root \/dfirt filling QI CV-skeleton to form /dafirad/ (Mc- Carthy, 1981)

d fir d

CV C CV C

a

There are some complexities to this system: in (12) an additional rule has to en- force the gemination of C2. Contrast this with Form IX, where left-to-right spread- ing takes place unimpeded:

(14) The (relatively rare) Form IX also has the CV-skeleton /CVCCVC/, but consonants are directly associated left-to-right to form /katbab/: (McCarthy, 1981)

k t b

C V C CV C

a

The second and fifth binyanim thus require a special rule, in the same spirit as the Eighth Binyan Flop, that enforces the geminate nature of the second root consonant. Conversely, when there are too many consonants in a root, as is the case in borrowed words that might be reanalysed as having 5 consonants in the root. In forms where the CV-skeleton has too few consonantal slots to accommodate all the consonants, the final C(s) may be dropped:

115 (15) Example of root truncation with a borrowed quinqueliteral root, Vmyntf / as in /maynat i:f/ 'magnet', to fit the QI template as /maynat / (McCarthy, 1981)

m y n t f

a

Again, our current model does not do anything to capture such phenomena.

4.3.1 Implementing spreading and truncation in the model

In the current model, when drawing roots and residues, we draw instances that are of exactly the right length to fit the number of r and s slots in the residue. Instead, we want to be able to draw roots and residues of any length, and spread or truncate them as needed. The main question now is how to assign a probability distribution over map- pings of root segments to the r slots and residue segments to the s slots, and how restrictive we want this mapping to be, a priori. As with the original model, I will construct this section of the model to balance keeping our assumptions as few as possible, while maintaining a manageable discrete probability space, and leave the rest to inference over actually observed data. The problem, then, is to assign, say, Vktb to four root slots rrrr. Following the principles of autosegmental phonology (Goldsmith, 1976) on which the model in McCarthy (1979, 1981) is based, we will assume that all the slots must be filled, and that no crossings are permitted, but nothing about the directionality of association. Thus, we will sample uniformly from the following possible mappings:

(16) a. kktb b. kttb

116 c. ktbb 2

This is equivalent to the problem of selecting a random integer partition of n = 4 into m = 3 parts (1,1,2; 1,2,1; 2,1,1). When there are more segments than slots, we will simply add empty slots to make up the difference. The placement of the empty slot(s) is again up to a ran- dom partitioning process, followed by the same random selection of a mapping via partition; thus the possible mappings of, say, Vmynt f to rrrrwill be:

(17) a. mynt _ b. myn-f

c. myt f d. m-nt f e. _ynt f

Thus, we can allow for left-to-right, right-to-left, outside-in and inside-out di- rections of association(Yip, 1988), along with some very likely unattested patterns of association; whether this overgeneration is a significant barrier to successful segmentation requires actual computational modelling as in Chapter 2 to be sub- stantiated.

4.3.2 Inference over this model

As with all complex discrete Bayesian models, implementing inference for this par- ticular model is challenging. One way to approach this it to make use of existing grammatical formalisms for which efficient parsers exist, that subsume the model on which inference is necessary. This is what Botha and Blunsom (2013) do in their approach to modelling non- concatenative morphology, in which they directly invoke the grammatical formal- ism of simple Range Concatenative Grammars (sRCGs) (Boullier, 2000). These are

2 An assumption made here: mappings that drop consonants such as kttt are not allowed.

117 theoretically equivalent to Multiple Context-Free Grammars (MCFGs) (Seki et al., 1991) and have also been shown to be equivalent to Minimalist Grammars (Sta- bler, 1996). In the Chomsky hierarchy, these are equivalent in power to mildly context-sensitive grammars (Joshi et al., 1990).

Botha and Blunsom (2013)'s generative model for Arabic words, written in terms of sRCG expansion rules, is as follows:

(18) a. Word -+ Pre* Stem Suf* This rule says simply that words can consist of any number of prefixes, a single stem, and any number of suffixes. b. Pre Stem | Suf -+ Char+ Prefixes, monomorphemic stems, and suffixes can expand into strings of segments. c. Multiple Stem expansion rules of the format: Example: Stem(abcde) - R3(ace) T2(bd) 3

This particular rule models intercalations such as Vktb + /ui/ - /kutib/ d. Rn I Tm -+ Char+ Roots and vocalisms expand into strings of the appropriate number of segments. 4

A fully segmented word can thus represented by a parse tree:

(19) Tree representation of the word /wa-kitab-i/ (Botha and Blunsom, 2013)

3Botha and Blunsom (2013) refer to the vocalism as the "template", hence T. The number follow- ing indicates the length of the root/vocalism to be expanded. 4To be precise, in Botha and Blunsom (2013)'s model, this rewriting takes place in recursive expansion steps: for example the expansion of R 3 to /ktb takes place with these three rewrite rules: R 3 -+ R2 b; R2 -+ R1 t; R1 -- k.

118 Word(wakitabi)

Pre(wa) Stm(kitab) Suf(i)

Root(k,t,b) Template(i,a)

w a k i t a-b i

Furthermore, each of these categories has its own sub-lexicon governed by a Pitman-Yor distribution. In the same way that in our original model, when we wanted a root, we could pick any root in the existing root sub-lexicon with proba- bility proportional to its existing frequency or an entirely new root, for each non- terminal node in this subtree, we can pick an existing expansion from its category sub-lexicon, or choose to generate an entirely new morpheme in that category. This is called an adaptorgrammar, originally defined by Johnson et al. (2007) over context-free grammars to model concatenative English morphology.

Botha and Blunsom (2013) do not attempt to model spreading, but we can straightforwardly extend their rule set to cover such phenomena. To model /katab/ as the interleaving of Vktb and the vocalism of a single underlying /a/ that spreads into the two vocalic slots, we would simply need a rule of the following sort:

(20) Stem(abcbd) -+ R3(acd) T1(b)

All vocalisms involving the spreading of /a/ can therefore be modelled by a single rule:

(21) T1 -+ a

These can then share probability mass, rather than dividing it up over rules such as:

(22) a. T1 -* a

119 b. T2 -aa

c. T3 -+ aaa d. ...

Still, there remains an impediment to this fully approximating the McCarthy (1979, 1981) model, namely that rule (18-c) Stem(abcde) -+ R3(ace) T2(bd) and rule (20)

Stem(abcbd) -+ R3(acd) T(b) are not the same and do not share probability mass, as opposed to the McCarthy model, where they are both represented by the identical CV-template /CVCVC/. We have conquered the problem of dividing probability mass over disparate instances of the same roots and residues, only to re-create the problem in the template sub-lexicon. Thus, further engineering to combine the probability mass over templatic rules during inference is still necessary. Truncation can conceivably be handled using rules such as the following:

(23) Rules to cover truncation in / maynat I:

a. Stem(abcdbe) -+ R5(acdef)T1(b) b. R5 -+ mynt f c. T1 -+ a

We can thus have /naynat / and maynat i:f/ share the same root. The danger here is that sRCG parsers will generally only postulate a rule expansion when it has located all the segments on the right-hand side of the rule, but here the seg- ment /f/ will not be seen. Therefore, the parser has to be augmented to be able to postulate missing segments, potentially vastly increasing parsing time - another engineering hurdle to overcome.

4.4 Memory limitations

Lastly, let us address an unexpected outcome of the experiment in Chapter 3, find- ing that extractability, contrary to the predictions of the model, did not have an effect on word acceptance. To recap, high-extractability morphemes were ones that

120 occurred once, but co-occurred with a high-frequency morpheme; low-extractability morphemes occurred with other low-frequency, low-extractability morphemes. An ideal Bayesian learner would be more successful at segmenting words con- taining high-frequency morphemes than low-frequency ones, and indeed we saw effects of frequency on word recognition. However, an ideal learner who had suc- cessfully segmented out a high-frequency morpheme from a word should also have segmented out its corresponding low-frequency counterpart. This we did not see. Pearl et al. (2010) provides a possible explanation: their suggestion is that hu- man learners are not ideal Bayesian learners, but are constrained by memory limi- tations. The learner in Chapter 2 could store the entire wordstream in memory and make several passes over the same set of data. A human learner is more likely to forget, and to have only online access to the data. Pearl et al. (2010) found that in fact, incorporating memory limitations into online learners extracted a better word lexicon in the task of word segmentation than an ideal Bayesian learner. A direction for future research, therefore, would be to constrast the perfor- mance of a memory-limited, online Bayesian learner versus an ideal batch learner on the computational models proposed herein.

4.5 Conclusion

To conclude, while learning non-concatenative morphology of the root-and-pattern sort seen in Semitic languages may appear to be a difficult task, I have shown that using inference over hierarchical Bayesian models, we can successfully segment the Arabic verbal stem lexicon, and discussed ways to augment this model to cover more aspects of Arabic morphology. Hierarchical Bayesian models have the advantage of making explicit our lin- guistic assumptions about how words are formed, and making testable predic- tions that can be proven or disproven through psycholinguistic experiments and matching to statistics in lexicons, historical, and child language corpora. The main

121 challenge to their wider application in linguistics is the difficulty of performing inference over such models; however, this challenge should become less daunt- ing over time as more research is put into improving the power, performance, and usability of general inference algorithms.

122 Appendix A

Formal specification of the generative model

This is a formal specification of the generative model in 2.2, and is identical to the specification given in Fullwood and O'Donnell (2013).

Let G be a distribution over primitive elements of the lexicon (e.g., words, roots, residues, templates, morphemes, etc.). The behaviour of PYP process PYP(a, b, G) with base measure G and parameters a and b can be described as follows. The first time we sample from PYP(a, b, G) a new lexical item will be sampled using G. On subsequent samples from PYP(a, b, G), we either reuse an existing lexical item i with probability T--4, where N is the number of lexical items sampled so far, ni is the number of times that lexical item i has been used in the past, and 0 < a < 1 and b > -a are parameters of the model (in our final experiments, we used values of a = 0.9 and b = 1). Alternatively, we sample a new lexical item with probability , where K is the number of existing lexical items in the morpheme lexicon.

In our model, we maintain three sub-lexicons for templates (LTp), roots (LRt), and residues (LR 5 ) each drawn from a Pitman-Yor process with its own hyperpa- rameters.

(1) LX - PYP(ax, bx, Gx)

123 where X E {Tp, Rt, Rs} Words are drawn by first drawing a template, then draw- ing a root and a residue (of the appropriate length) and inserting the segments from the root and residue in the appropriate positions in the word as indicated by the template. Our templates are strings in {Rt, Rs}* indicating for each position in a word whether that position is part of the word's root (Rt) or residue (Rs). These templates themselves are drawn from a base measure GTp which is defined as fol- lows. To add a new template to the template lexicon first draw a length for that template, 1, from a Poisson distribution.

(2) 1 ~ POISSON(5)

In the example illustrated in (18), we sampled a template length of 5. We then sample a template of length I by drawing a Bernoulli random variable ti for each position i E 1..l is a root or residue position.

(3) ti ~ BERNOULLI(O)

The base measure over templates, Grp, is defined as the concatenation of the ti's. The base distributions over roots and residues, GRt and GRs, are drawn in the following manner. Having drawn a template, T we know the lengths of the root,

'Rt, and residue IRt. For each position in the root or residue ri where i E 1 ..lRt/Rs, we sample a phone from a uniform distribution over phones.

(4) ri ~ UNIFORM (alphabet)

124 Appendix B

Artificial Grammar Experimental Stimuli

The following are the stimuli used in the Artificial Grammar experiment of Chap- ter 3. For brevity, I will only present one of the eight permutations of segments for each of the three conditions in this appendix. Full sets of stimuli that were sup- plied as items . txt to Experigen (Becker and Levine, 2013) are available at https: //github. com/michelleful/DissertationMaterials/martian_experiment.

(1) Legend for symbols:

Symbol Train set frequency Test/Distractor set frequency 0a 4 1 0 1 (High Extractability) 1 (High Extractability) 0 1 (Low Extractability) 1 (Low Extractability) 0 1 Word frequency 1 Word frequency 1

125 (2) Train set, presented three times in randomised order

Item # English-like Arabic-like Unattested C-V mix 1 mahtsee teeksoo leepmai 2 sohppah kaitpoh mahtkoo 3 pohsloo peitkai sahkpoo 4 taistee taimtai meiktee 5 kahlsah seemlee pahmmah 6 pootkoo keilkoo kailmoh 7 meekpee mohkloo teelkai 8 lahploh kootsei looslei 9 tohkmai lahspai sohpsoh 10 mooptei pahpsah mohtpee 11 toomlai pooltee paiklee nKI 12 leimpai looptoh peitloo 13 kailmoh meeslei tooltoh 14 seilkoh tohsmei koopsai 15 keettah saikpah lohmkah 16 leessoo seimkah seempah 17 peikkei lohpmoh keissei

18 saimmei mahlmee taistei 19 mahtloo teiksai lahppai - 20 sohpsee keetpoo meetmoo 21 mahtpai tooksoh leiplai 22 toomsee peeltoo peekmee 23 mahtmei tahksee laiptai 24 peiksee leepmoo keesmei 25 sohpkoh kohtpei mootsoo

126 26 kailloo meislai tahlpoh ~II~ 27 sohpmei kahtpee maittoo 28 peikloo leipmai kahspei 29 toomkoh pohltei pooksee 30 kailpai moosloh teilloh 31 toomtah pailtah pohkkee ~ 32 leespai soomkoh seimlah *6~ 33 kailtah maislah tohlkoh 34 leeskoh sohmkei soomsah R8 SI~ 35 peiktah laipmah kohskei 36 leesmei sahmkee saimtah ~I7~

(3) Test set, presented once in randomised order, interleaved with distractor set

Item # English-like Arabic-like Unattested C-V mix

1 mahtsee teeksoo leepmai 2 sohppah kaitpoh mahtkoo 3 pohsloo peitkai sahkpoo 4 taistee taimtai meiktee ...... 5 toompai pooltoh peiklee ®Iil,. 6 kailsoo meislah teelpoh 7 keetkoh sohkpei loomsah 8 peikkoo leipmoo kaismei 9 kahlmei sahmlee paimtah 10 leespoo saimkee seimkah 11 sooptah tailkah lohkkah 12 leimlai looptee paitloo 13 saimloh moolmei tooslei

127 14 meekkei mohkloh teilsai 15 tohktei lahspah sohppoh 16 seilmee teismei koopmai 17 mootmoh seelmei mooptoh 18 pootsai kahlkoo kahltoh 19 laimsah keesmee tahsmee

(4) Distractor set, presented once in randomised order, interleaved with test set

Item # English-like Arabic-like Unattested C-V mix 20 meiktoo keeltoh seellah 21 paimlee pohslai pahssai 22 tahlmoo meespah koolpoo 23 teekpoo peikloh teipmei 24 seeskai looksoo kahmlah 25 mohpmah taimkei saiksee <-> 26 lootmee lohkpee laippoh <-< 27 soomsoh mohtpah meektai 28 paimsai kaimloo meilpah 29 kohspei soopmah lohmlei 30 kahtlei laimtee pohlmoo 31 laistoh kahpsoo kootpah <- 32 kooptai meiptee sohmtoh 33 teimlah pahtsoh soopkee 34 keilsei tohtkee peemsei 35 peekkah meekpai teessah 36 lohkpoh soolkai lohtmai <-> 37 sahltai toolmai meitpah 38 peetkee lahsmei taikkoh o

128 129 130 Bibliography

Al-Shawakfa, Emad, Amer Al-Badarneh, Safwan Shatnawi, Khaleel Al-Rabab'ah, and Basel Bani-Ismail. 2010. A comparison study of some Arabic root finding algorithms. Journal of the American Society for Information Science and Technology 61:1015-1024. URL http://dx.doi.org/10.1002/asi.21301.

Albright, Adam, and Bruce Hayes. 2003. Rules vs. analogy in English past tenses: A computational/experimental study. Cognition 90:119-161.

Aronoff, Mark. 1994. Morphology by itself- Stems and inflectional classes. 22. MIT press.

Aslin, Richard N, Jenny R Saffran, and Elissa L Newport. 1998. Computation of conditional probability statistics by 8-month-old infants. Psychological science 9:321-324.

Baayen, Harald R., Richard Piepenbrock, and Leon Gulikers. 1995. The CELEX lexical database. Release 2 (CD-ROM). Philadelphia, Pennsylvania: Linguistic Data Consortium, University of Pennsylvania.

Bagemihl, Bruce. 1989. The crossing constraint and backwards languages. Natural language & linguistic Theory 7:481-549.

Bailey, Todd M, and Ulrike Hahn. 2001. Determinants of wordlikeness: Phonotac- tics or lexical neighborhoods? Journal of Memory and Language 44:568-591.

Baker, Mark. 1985. The Mirror Principle and morphosyntactic explanation. Lin- guistic inquiry 16:373-415.

Bar-Lev, Z. 1977. Natural-abstract Hebrew phonology. Folia linguistica 11:259-272.

Bat-El, Outi. 1994. Stem modification and cluster transfer in Modern Hebrew. Nat- ural Language and Linguistic Theory 12:571-593.

Bat-El, Outi. 2003. The fate of the consonantal root and the binyan in Optimality Theory. Recherches Linguistiques de Vincennes 32:31-60.

Bates, Douglas, Martin Machler, Ben Bolker, and Steve Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of StatisticalSoftware 67:1-48.

131 Becker, Michael, and Jonathan Levine. 2013. Experigen: An online experiment platform. Available at http: //becker. phonoiogist. org/experigen . Beesley, Kenneth R. 1991. Computer analysis of Arabic morphology: A two-level approach with detours. In Perspectives on Arabic Linguistics III: Papersfrom the Third Annual Symposium on Arabic Linguistics, ed. Bernard Comrie and Mushira Eid, 155-172. John Benjamins. Read originally at the Third Annual Symposium on Arabic Linguistics, University of Utah, Salt Lake City, Utah, 3-4 March 1989. Bentin, Shlomo, and Ram Frost. 2001. Linguistic theory and psychological reality: A reply to Boudelaa & Marslen-Wilson. Cognition 81:113-118. Benua, Laura. 1997. Transderivational identity: Phonological relations between words. Doctoral Dissertation, UMass Amherst. Berg, Thomas, and Hassan Abd-El-Jawad. 1996. The unfolding of suprasegmental representations: A cross-linguistic perspective. Journal of Linguistics 32:291-324. Berko, Jean. 1958. The child's learning of English morphology. Word 14:150-177. Bohas, Georges. 1997. Matrices, etymons, racines: elements d'une theorie lexicologique du vocabulairearabe. Peeters Publishers. Bonatti, Luca L, Marcela Pena, Marina Nespor, and . 2005. Linguis- tic constraints on statistical computations: The role of consonants and vowels in continuous speech processing. Psychological Science 16:451-459. Botha, Jan A., and Phil Blunsom. 2013. Adaptor grammars for learning non- concatenative morphology. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, 345-356. Seattle, Washington, USA: Association for Computational Linguistics. URL http: //www. aciweb. org/ anthology/D13-1034. Boudelaa, Sami, and William D Marslen-Wilson. 2001. Morphological units in the Arabic mental lexicon. Cognition 81:65-92. Boudelaa, Sami, and William D Marslen-Wilson. 2004. Abstract morphemes and lexical representation: The CV-Skeleton in Arabic. Cognition 92:271-303. Boullier, Pierre. 2000. A cubic time extension of context-free grammars. Grammars 3:111-131. Buckwalter, Tim. 2002. Buckwalter Arabic morphological analyzer version 1.0. Technical Report LDC2002L49, Linguistic Data Consortium. Campbell, Lyle. 1998. HistoricalLinguistics. Edinburgh University Press, 1 edition. Chomsky, Noam, and Morris Halle. 1968. The sound pattern of English. New York: Harper and Row.

132 Cutler, Anne, Nuria Sebastiatn-Galles, Olga Soler-Vilageliu, and Brit Van Ooijen. 2000. Constraints of vowels and consonants on lexical selection: Cross-linguistic comparisons. Memory & Cognition 28:746-755.

Daya, Ezra, Dan Roth, and Shuly Wintner. 2004. Learning Hebrew roots: Machine learning with linguistic constraints. In Proceedings of the 2004 Conference on Em- pirical Methods in Natural Language Processing.

De Roeck, Anne N, and Waleed Al-Fares. 2000. A morphologically sensitive clus- tering algorithm for identifying Arabic roots. In Proceedings of the 38th An- nual Meeting on Associationfor ComputationalLinguistics, 199-206. Association for Computational Linguistics.

Drake, Shiloh Nicole. 2018. Li biases in learning root-and-pattern morphology. Doctoral Dissertation, University of Arizona.

Dukes, Kais. 2011. Quranic Arabic Corpus. URL http: //corpus .quran. com/.

Dutoit, Thierry, Vincent Pagel, Nicolas Pierret, Frangois Bataille, and Olivier Van der Vrecken. 1996. The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes. In Spoken Lan- guage, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 3, 1393-1396. IEEE.

Elsner, Micha, Sharon Goldwater, Naomi Feldman, and Frank Wood. 2013. A joint learning model of word segmentation, lexical acquisition, and phonetic variabil- ity. In Proceedings of the 2013 Conference on EmpiricalMethods in Natural Language Processing,42-54.

Faust, Noam. 2005. The fate of Hebrew gutturals. Doctoral Dissertation, Tel Aviv University. M.A. thesis.

Faust, Noam. 2012. Non-concatenative realization in the verbal inflection of Mod- ern Hebrew. Morphology 22:453-484.

Feldman, Naomi H, Thomas L Griffiths, Sharon Goldwater, and James L Morgan. 2013. A role for the developing lexicon in phonetic category acquisition. Psycho- logical Review 120:751.

Floccia, Caroline, Thierry Nazzi, Claire Delle Luche, Silvana Poltrock, and Jeremy Goslin. 2014. English-learning one-to two-year-olds do not show a consonant bias in word learning. Journalof Child Language 41:1085-1114.

Frank, Michael, Harry Tily, Inbal Arnon, and Sharon Goldwater. 2010a. Beyond transitional probabilities: Human learners impose a parsimony bias in statistical word segmentation. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 32.

133 Frank, Michael C, Sharon Goldwater, Thomas L Griffiths, and Joshua B Tenen- baum. 2010b. Modeling human performance in statistical word segmentation. Cognition 117:107-125. French, Koleen M. 1988. Insights into Tagalog: Reduplication, infixation, and stress from nonlinearphonology. Summer Institute of Linguistics & University of Texas at Arlington. Frisch, Stefan, Michael Broe, and Janet Pierrehumbert. 1997. Similarity and phono- tactics in Arabic. Rutgers Optimality Archive 223. Frisch, Stefan A, Nathan R Large, and David B Pisoni. 2000. Perception of word- likeness: Effects of segment probability and length on the processing of non- words. Journal of Memory and Language 42:481. Frost, Ram, Kenneth I Forster, and Avital Deutsch. 1997. What can we learn from the morphology of Hebrew? A masked-priming investigation of morphological representation. Journal of Experimental Psychology: Learning, Memory, and Cogni- tion 23:829. Fullwood, Michelle, and Tim O'Donnell. 2013. Learning non-concatenative mor- phology. In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL), 21-27. Sofia, Bulgaria: Association for Com- putational Linguistics. URL http: //www. aclweb. org/anthology/W13-2603. Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on PatternAnalysis and Machine Intelligence PAMI-6:721-741. Goldsmith, John. 2000. Linguistica: An automatic morphological analyzer. In Proceedings of 36th meeting of the Chicago Linguistic Society. Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. ComputationalLinguistics 27:153-189. Goldsmith, John A, Jackson L Lee, and Aris Xanthos. 2017. Computational learning of morphology. Annual Review of Linguistics 3:85-106. Goldsmith, John Anton. 1976. Autosegmental phonology. Doctoral Dissertation, Massachusetts Institute of Technology. Goldwater, Sharon, Thomas L. Griffiths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112:21-54. Goldwater, Sharon, and Mark Johnson. 2003. Learning OT constraint rankings using a maximum entropy model. In Proceedings of the Stockholm workshop on variation within Optimality Theory, ed. Jennifer Spenader, Anders Eriksson, and Osten Dahl, 111-120. Stockholm: Stockholm University.

134 Greenberg, Joseph H, and James J Jenkins. 1964. Studies in the psychological cor- relates of the sound system of American English. Word 20:157-177. Halle, Morris. 1959. The Sound Pattern of Russian: A linguistic and acoustical investi- gation, volume 1. The Hague: Mouton. Hammarstr6m, Harald, and Lars Borin. 2011. Unsupervised learning of morphol- ogy. Computational Linguistics 37:309-350. Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109. Hayes, Bruce, and Colin Wilson. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic inquiry 39:379-440. Hoberman, Robert D, and Mark Aronoff. 2003. The verbal morphology of Maltese. Language Acquisition and Language Disorders 28:61-78. Hockett, Charles F. 1954. Two models of grammatical description. Word 10:210- 234. Idrissi, Ali, and Eva Kehayia. 2004. Morphological units in the Arabic mental lexi- con: Evidence from an individual with deep dyslexia. Brain and language 90:183- 197. Idrissi, Ali, Jean-Frangois Prunet, and Renee Beland. 2008. On the mental repre- sentation of Arabic roots. Linguistic Inquiry 39:221-259. Johnson, Mark, Thomas L Griffiths, and Sharon Goldwater. 2007. Adaptor gram- mars: A framework for specifying compositional nonparametric Bayesian mod- els. In Advances in neural information processing systems, 641-648. Joshi, Aravind K, K Vijay Shanker, and David Weir. 1990. The convergence of mildly context-sensitive grammar formalisms. Technical Reports (CIS) 539. Jusczyk, Peter W, and Paul A Luce. 1994. Infants' sensitivity to phonotactic pat- terns in the native language. Journal of Memory and Language 33:630-645. Kataja, Laura, and Kimmo Koskenniemi. 1988. Finite-state description of Semitic morphology: a case study of Ancient Akkadian. In Proceedings of the 12th confer- ence on Computational linguistics - Volume 1, COLING '88, 313-315. Stroudsburg, PA, USA: Association for Computational Linguistics. URL http: //dx. doi .org/ 10.3115/991635.991699. Kemp, Charles, Amy Perfors, and Joshua B Tenenbaum. 2007. Learning overhy- potheses with hierarchical Bayesian models. Developmental science 10:307-321. Kenstowicz, Michael, and Charles Kisseberth. 1977. Topics in phonological theory. New York: Academic Press.

135 Kenstowicz, Michael, and Charles Kisseberth. 1979. Generative phonology: Descrip- tion and theory. Academic Press.

Khaliq, Bilal, and John Carroll. 2013. Unsupervised induction of Arabic root and pattern lexicons using machine learning. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria, 7-13 September 2013, 350-356.

LaCross, Amy. 2015. Khalkha Mongolian speakers' vowel bias: Li influences on the acquisition of non-adjacent vocalic dependencies. Language, Cognition and Neuroscience 30:1033-1047.

Lee, Yoong Keok. 2012. Unsupervised Bayesian Models of Morphological Segmen- tation. Slides presented at the Computational Morphology Working Group.

Lee, Yoong Keok, Aria Haghighi, and Regina Barzilay. 2011. Modeling syntactic context improves morphological segmentation. In Proceedings of the Conference on Natural Language Learning.

Mahfoudhi, Abdessatar. 2007. The place of the etymon and the phonetic matrix in the Arabic mental lexicon. Journal of Arabic and Islamic Studies 7:74-102.

Marcus, Mitchell P., Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. 1999. Treebank 3 technical report. Technical report, Linguistic Data Consortium, Philadelphia.

Marquis, Alexandra, and Rushen Shi. 2012. Initial morphological learning in pre- verbal infants. Cognition 122:61-66.

McCarthy, John J. 1979. Formal Problems in Semitic Phonology and Morphology. Doctoral Dissertation, Massachusetts Institute of Technology

McCarthy, John J. 1981. A prosodic theory of nonconcatenative morphology. Lin- guistic Inquiry 12:373-418.

McCarthy, John J. 2003. OT constraints are categorical. Phonology 20:75-138. McCarthy, John J., and Alan Prince. 1990. Foot and word in prosodic morphology: The Arabic broken plural. Natural Language and Linguistic Theory 8:209-283.

McCarthy, John J, and Alan Prince. 1993a. Generalized alignment. In Yearbook of morphology 1993, 79-153. Springer. McCarthy, John J, and Alan Prince. 1995. Faithfulness and reduplicative identity. Linguistics Department Faculty Publication Series . McCarthy, John J, and Alan S Prince. 1993b. Prosodic morphology: Constraint interac- tion and satisfaction. Linguistics Department Faculty Publication Series, 14.

136 Meyer, Anthony, and Markus Dickinson. 2016. A multilinear approach to the un- supervised learning of morphology. In Proceedings of the 14th SIGMORPHON Workshop on ComputationalResearch in , Phonology, and Morphology, 131- 140. Mintz, Toben H. 2004. Morphological segmentation in 15-month-old infants. In Proceedings of the 28th Annual Boston University Conference on Language Develop- ment, volume 2, 363-374. Cascadilla Press Somerville, MA. Mintz, Toben H, Rachel L Walker, Ashlee Welday, and Celeste Kidd. 2018. In- fants' sensitivity to vowel harmony and its role in segmenting speech. Cognition 171:95-107. Mochihashi, Daichi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian Unsu- pervised Word Segmentation with Nested Pitman-Yor Language Modeling. In ACL/IJCNLP, 100-108. Narasimhan, Karthik, Regina Barzilay, and Tommi Jaakkola. 2015. An unsuper- vised method for uncovering morphological chains. Transactions of the Associa- tion for ComputationalLinguistics 3. Nazzi, Thierry, Silvana Poltrock, and Katie Von Holzen. 2016. The developmental origins of the consonant bias in lexical processing. Current Directions in Psycho- logical Science 25:291-296. Nespor, Marina, Marcela Pefia, and Jacques Mehler. 2003. On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e linguaggio 2:203-230. Newport, Elissa L, and Richard N Aslin. 2004. Learning at a distance I. Statistical learning of non-adjacent dependencies. Cognitive psychology 48:127-162. Nishibayashi, L6o-Lyuki, and Thierry Nazzi. 2016. Vowels, then consonants: Early bias switch in recognizing segmented word forms. Cognition 155:188-203. Nosofsky, Robert M. 1986. Attention, similarity, and the identification- categorization relationship. Journalof experimental psychology: General 115:39. Orgun, Cemil Orhan, and Ronald L Sprouse. 1999. From MPARSE to CONTROL: deriving ungrammaticality. Phonology 16:191-224. Pariente, Itsik. 2012. Grammatical paradigm uniformity. Morphology 22:485-514. Pearl, Lisa, Sharon Goldwater, and Mark Steyvers. 2010. Online learning mech- anisms for Bayesian models of word segmentation. Research on Language and Computation 8:107-132. Pinker, Steven. 1999. Words and rules: The ingredients of language. New York: HarperCollins.

137 Pitman, Jim, and Marc Yor. 1995. The two-parameter Poisson-Dirichlet distribu- tion derived from a stable subordinator. Technical report, Department of Statis- tics University of California, Berkeley. Poon, Hoifung, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised mor- phological segmentation with log-linear models. In Proceedings of Human Lan- guage Technologies: The 2009 Annual Conference of the North American Chapter of the Associationfor Computational Linguistics, NAACL '09, 209-217. Stroudsburg, PA, USA: Association for Computational Linguistics. Prince, Alan, and Paul Smolensky. 1993. Optimality Theory: constraint interaction in . Ms, Rutgers University and University of Colorado, Boulder.

Prunet, Jean-Frangois. 1996. Guttural vowels. Essays on Gurage languageand culture 175-203. Prunet, Jean-Franeois. 2006. External evidence and the Semitic root. Morphology 16:41. Prunet, Jean-Franeois, Renee Beland, and Ali Idrissi. 2000. The mental representa- tion of Semitic words. Linguistic inquiry 31:609-648. Rasin, Ezer. submitted. Morpheme structure constraints and blocking in non- derived environments. MIT. Rasin, Ezer, and Roni Katzir. 2014. A learnability argument for constraints on un- derlying representations. In Proceedings of the 45th Annual Meeting of the North East Linguistic Society (NELS 45), Cambridge, Massachusetts. Ratcliffe, Robert R. 1997. Prosodic templates in a word-based morphological analy- sis of Arabic. In Perspectives on Arabic linguistics X, ed. Mushira Eid and Robert R. Ratcliffe, 147-171. Amsterdam: John Benjamins. Ratcliffe, Robert R. 1998. The broken plural problem in Arabic and comparative Semitic: allomorphy and analogy in non-concatenative morphology, volume 168. John Ben- jamins Publishing. Rodrigues, Paul, and Damir Cavar. 2007. Learning Arabic morphology using sta- tistical constraint-satisfaction models. In Perspectives on Arabic Linguistics: Pro- ceedings of the 19th Arabic Linguistics Symposium, ed. Elabbas Benmamoun, 63-75. John Benjamins Publishing Company. Saffran, Jenny R, Richard N Aslin, and Elissa L Newport. 1996a. Statistical learning by 8-month-old infants. Science 274:1926-8. Saffran, Jenny R, Elissa L Newport, and Richard N Aslin. 1996b. Word segmenta- tion: The role of distributional cues. Journal of memory and language 35:606-621.

138 Schuler, Kathryn D, Charles Yang, and Elissa L Newport. 2016. Testing the toler- ance principle: Children form productive rules when it is more computationally efficient to do so. In Proceedings of the 38th Annual Conference of the Cognitive Science Society. Seki, Hiroyuki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On multiple context-free grammars. Theoretical Computer Science 88:191-229. Snyder, Benjamin, and Regina Barzilay. 2008. Unsupervised multilingual learning for morphological segmentation. In ACL, 737-745. Stabler, Edward. 1996. Derivational minimalism. In International Conference on Logical Aspects of ComputationalLinguistics, 68-95. Springer. Stanley, Richard. 1967. Redundancy rules in phonology. Language 393-436. Tucker, Matthew A. 2010. Roots and prosody: the Iraqi Arabic derivational verb. Recherches linguistiques de Vincennes 31-68. Ussishkin, Adam. 2003. Templatic effects as fixed prosody: The verbal system in Semitic. Amsterdam Studies in the Theory and History of Linguistic Science, series 4 511-530. Van Ooijen, Brit. 1996. Vowel mutability and lexical selection in English: Evidence from a word reconstruction task. Memory & Cognition 24:573-583. Vitevitch, Michael S, Paul A Luce, Jan Charles-Luce, and David Kemmerer. 1997. Phonotactics and syllable stress: Implications for the processing of spoken non- sense words. Language and speech 40:47-62. Xanthos, Aris. 2018. A Phonological Approach to the Unsupervised Learning of Root-and-Pattern Morphology. Shaping Phonology 309-327. Xu, Jia, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian Semi- Supervised Chinese Word Segmentation for Statistical Machine Translation. In COLING, ed. Donia Scott and Hans Uszkoreit, 1017-1024. Yang, Charles. 2005. On productivity. Yearbook of Language Variation 5:333-370. Yang, Charles. 2016. The price of linguistic productivity: How children learn to break the rules of language. MIT Press. Yip, Moira. 1988. Template morphology and the direction of association. Natural Language & Linguistic Theory 6:551-577. Zukoff, Sam. 2017a. Arabic Nonconcatenative Morphology and the Syntax- Phonology Interface. In NELS 47: Proceedings of the Forty-Seventh Annual Meeting of the North East Linguistic Society, ed. Andrew Lamont and Katerina Tetzloff, volume 3, 295-314. Graduate Linguistics Student Association.

139 Zukoff, Sam. 2017b. The Mirror Alignment Principle: Morpheme Ordering at the Morphosyntax-Phonology Interface. In Papers on Morphology, ed. Snejana Iovtcheva and Benjamin Storme, MIT Working Papers in Linguistics 81, 105-124. MITWPL.

140