Towards a Psycholinguistically Motivated Dependency Grammar for Hindi
Total Page:16
File Type:pdf, Size:1020Kb
Towards a psycholinguistically motivated dependency grammar for Hindi Samar Husain Rajesh Bhatt Shravan Vasishth Department of Linguistics University of Massachusetts Department of Linguistics Universitat¨ Potsdam Amherst, MA , USA Universitat¨ Potsdam Germany [email protected] Germany [email protected] [email protected] Abstract Mary, ‘John’ and ‘Mary’ are connected via arcs to ‘kissed’; the former arc bears the label ‘sub- The overall goal of our work is to build ject’ and the latter arc the label ‘object’. Taken a dependency grammar-based human sen- together, these nodes with their connections form tence processor for Hindi. As a first step a tree. Figure 1 shows the dependency and the towards this end, in this paper we present phrase structure trees for the above sentence. a dependency grammar that is motivated by psycholinguistic concerns. We describe the components of the grammar that have been automatically induced using a Hindi dependency treebank. We relate some as- pects of the grammar to relevant ideas in the psycholinguistics literature. In the pro- cess, we also extract statistics and pat- terns for phenomena that are interesting from a processing perspective. We fi- nally present an outline of a dependency grammar-based human sentence processor for Hindi. 1 Introduction Human sentence processing proposals and mod- eling works are overwhelmingly based on phrase- structure parsing and constituent based representa- tion. This is because most modern linguistic theo- ries (Chomsky, 1965), (Chomsky, 1981), (Chom- sky, 1995), (Bresnan and Kaplan, 1982), (Sag et al., 2003) use phrase structure representation to analyze a sentence. There is, however, an alter- Figure 1: Dependency tree and Phrase structure native approach to sentential syntactic representa- tree tion, known as dependency representation, that is quite popular in Computational Linguistics (CL). There have been some previous attempts to use Unlike phrase structures where the actual words lexicalized grammars such as LTAG, CCG, etc. of the sentence appear as leaves, and the inter- in psycholinguistics. These lexicalized grammars nal nodes are phrases, in a dependency grammar have been independently shown to be related to (Mel’cuk,ˇ 1988), (Bharati et al., 1995), (Hud- dependency grammar (Kuhlmann, 2007). For ex- son, 2010) a syntactic tree comprises of sentential ample, Pickering and Barry (1991) used catego- words as nodes. These words/nodes are connected rial grammar to handle processing of empty cat- to each other with edges/arcs. The edges can be egories. Similarly, Pickering (1994) used depen- labeled to show the type of relation between a pair dency categorial grammar to process both local of node. For example, in the sentence John kissed and non-local dependencies. More recenly, Ta- 108 Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013), pages 108–117, Prague, August 27–30, 2013. c 2013 Charles University in Prague, Matfyzpress, Prague, Czech Republic bor and Hutchins (2004) used a lexical grammar ogy with verb-subject2, noun-adjective agreement. based parser and Kim et al. (1998) used lexical- Examples (2) to (6) below show some of the possi- ized tree-adjoining grammar (LTAG) to model cer- ble word order variations possible with (1). These tain processing results. Demberg (2010) has re- permutations are not exhaustive - in fact all 4! per- cently proposed a psycholinguistically motivated mutations are possible. LTAG (P-LTAG). Despite the success of depen- (1) malaya ne abhiisheka ko kitaaba dii dency paradigm in CL, it has remained unex- Malaya ERG Abhishek DAT book gave plored in psycholinguistics. To our knowledge, ‘Malaya gave a book to Abhishek.’ (S-IO-O-V) the work by Boston and colleagues (Boston et al., (2) malaya ne kitaaba abhiisheka ko dii (S-O-IO-V) 2011), (Boston et al., 2008) is the only such at- (3) abhiisheka ko malaya ne kitaaba dii (IO-S-O-V) tempt. There are some very interesting open ques- tions with respect to using dependency representa- (4) abhiisheka ko kitaaba malaya ne dii (IO-O-S-V) tion and dependency parsers while building a hu- (5) kitaaba abhiisheka ko malaya ne dii (O-IO-S-V) man sentence processing system. Can a process- (6) kitaaba malaya ne abhiisheka ko dii (O-S-IO-V) ing model based on dependency parsing paradigm account for classic psycholinguistic phenomena? A great deal of experimental research has shown Can one adapt a high performance dependency that working-memory limitations play a major role parser for psycholinguistic research? If yes, then in sentence comprehension difficulty (e.g., Lewis how? How will the differences in different depen- and Vasishth (2005)). We find numerous instances dency parsing paradigms affect the predictive ca- in natural language where a word needs to be tem- pacity of the models based on them? porarily retained in memory before it can be in- This paper is arranged as follows, in Section 2 tegrated as part of a larger structure. Because of we mention some experimental works that have limited working-memory, retaining a word for a motivated the grammar design. Section 3 dis- longer period can make sentence processing diffi- cusses the grammar induction process and lists out cult. Abstracting away from details, on this view, the main components of the grammar. In Section one way in which processing complexity can be 4 we present statistics of Hindi word order varia- formulated is by using metrics that can incorporate tions as found in the treebank and point out some dependent-head distance (Gibson, 2000), (Grod- patterns that are interesting from a processing per- ner and Gibson, 2005). This idea manifests itself spective. We also talk about prediction rules that in various forms in the psycholinguistics literature. are an important component in the grammar. Sec- For example, Gibson (2000) proposes integration tion 5 then presents a proposal for developing a cost and storage cost to account for processing human sentence processing system by adapting complexity. Lewis and Vasishth (2005) have pro- graph-based dependency parsing. Finally in Sec- posed a working memory-based theory that uses tion 6 we discuss some issues and challenges in the notion of decay as one determinant of memory using dependency grammar paradigm for human retrieval difficulty. Elements that exists in mem- sentence processing. We conclude in Section 7. ory without being retrieved for a long time will decay more, compared to elements that have been 2 Motivation: Some relevant retrieved recently or elements that are recent. In experimental work addition to decay, the theory also incorporates the notion of interference. Memory retrievals are fea- Some crucial design decisions in our research have ture based, and feature overlap during retrieval, in been inspired by psycholinguistic experimental addition to decay, will cause difficulty. work. In this section we will mention these works. As opposed to locality-based accounts men- But before that, we will briefly discuss Hindi, the tioned above, expectation-based theories appeal to language that we are working with. the predictive nature of sentence processor. On Hindi is one of the official languages of India. this view, processing becomes difficult if the up- Hindi is the fourth most widely spoken language coming sentential material is less predictable. Sur- 1 in the world . It is a free-word order language prisal (Hale, 2001), (Levy, 2008) is one such and is head final. It has relatively rich morphol- account. Informally, surprisal increases when a 1(http://www.ethnologue.com/ethno docs/distribution.asp? 2This is the default agreement pattern. The complete by=size). agreement system is much more complex. 109 parser is required to build some low-probability ing approach. structure. Expectation-based theories can successfully ac- 3 Inducing a grammar count for so called anti-locality effects. It has To develop a dependency grammar we will make been noted that in some language phenomena, in- use of an already existing Hindi dependency tree- creasing the distance between the dependent and bank (Bhatt et al., 2009). The treebank data is its head speeds up reading time at the head (see a collection of news articles from a Hindi news- Konieczny (2000) for such effects in German, and paper and has 400k words. The task of auto- Vasishth and Lewis (2006) for Hindi). This is, matic induction of grammar from a treebank can of course, contrary to the predictions made by be thought of as making explicit the implicit gram- locality-based accounts where such an increase mar present in the treebank. This approach can be should cause slowdown at the head. There is con- beneficial for a variety of tasks, such as, comple- siderable empirical evidence supporting the idea menting traditional hand-written grammars, com- of predictive parsing in language comprehension paring grammars of different languages, building (e.g. Staub and Clifton (2006)). There is also parsers, etc. (Xia and Palmer, 2001), (Kolachina some evidence that shows that the nature of pre- et al., 2010). Our task is much more focused, we dictive processing can be contingent on language want to bootstrap a grammar that can be used for specific characteristics. For example, Vasishth et a dependency-based human sentence processor for al. (2010) argue that the verb-final nature of Ger- Hindi. man subordinate clauses leads to the parser main- taining future predictions more effectively as com- 3.1 Lexicon pared to English. As Hindi is a verb-final lan- guage, these experimental results become perti- The lexicon comprises of syntactic properties of nent for this paper. Locality-based results are gen- various heads (e.g. verbs). Based on a priori selec- erally formalized using the limited working mem- tion of certain argument relations (subject, object, ory model. Such a model enforces certain re- indirect object, experiencer verb subject, goal, source limitations within which human sentence noun complements of subject for copula verbs) 3 processing system operates.