Using Sentence Type Information for Syntactic Category Acquisition
Total Page:16
File Type:pdf, Size:1020Kb
Using Sentence Type Information for Syntactic Category Acquisition Stella Frank ([email protected]) Sharon Goldwater ([email protected]) Frank Keller ([email protected]) School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK Abstract we hypothesise that sentence type is a useful cue for In this paper we investigate a new source of syntactic category learning. We test this hypothesis by information for syntactic category acquisition: incorporating sentence type information into an unsu- sentence type (question, declarative, impera- pervised model of part of speech tagging. tive). Sentence type correlates strongly with We are unaware of previous work investigating the intonation patterns in most languages; we hy- usefulness of this kind of information for syntactic pothesize that these intonation patterns are a category acquisition. In other domains, intonation has valuable signal to a language learner, indicat- been used to identify sentence types as a means of im- ing different syntactic patterns. To test this hy- proving speech recognition language models. Specifi- pothesis, we train a Bayesian Hidden Markov cally, (Taylor et al., 1998) found that using intonation Model (and variants) on child-directed speech. to recognize dialogue acts (which to a significant extent We first show that simply training a separate correspond to sentence types) and then using a special- model for each sentence type decreases perfor- ized language model for each type of dialogue act led mance due to sparse data. As an alternative, we to a significant decrease in word error rate. propose two new models based on the BHMM In the remainder of this paper, we first present the in which sentence type is an observed variable Bayesian Hidden Markov Model (BHMM; Goldwater which influences either emission or transition and Griffiths (2007)) that is used as the baseline model probabilities. Both models outperform a stan- of category acquisition, as well as our extensions to dard BHMM on data from English, Cantonese, the model, which incorporate sentence type informa- and Dutch. This suggests that sentence type tion. We then discuss the distinctions in sentence type information available from intonational cues that we used and our evaluation measures, and finally may be helpful for syntactic acquisition cross- our experimental results. We perform experiments on linguistically. corpora in four different languages: English, Spanish, Cantonese, and Dutch. Our results on Spanish show no 1 Introduction difference between the baseline and the models incor- porating sentence type, possibly due to the small size of Children acquiring the syntax of their native language the Spanish corpus. Results on all other corpora show have access to a large amount of contextual informa- a small improvement in performance when sentence tion. Acquisition happens on the basis of speech, and type is included as a cue to the learner. These cross- the acoustic signal carries rich prosodic and intona- linguistic results suggest that sentence type may be a tional information that children can exploit. A key task useful source of information to children acquiring syn- is to separate the acoustic properties of a word from the tactic categories. underlying sentence intonation. Infants become attuned to the pragmatic and discourse functions of utterances 2 BHMM Models as signalled by intonation extremely early; in this they are helped by the fact that intonation contours of child 2.1 Standard BHMM and infant directed speech are especially well differen- We use a Bayesian HMM (Goldwater and Griffiths, tiated between sentence types (Stern et al., 1982; Fer- 2007) as our baseline model. Like a standard trigram nald, 1989). Children learn to use appropriate intona- HMM, the BHMM assumes that the probability of tag tional melodies to communicate their own intentions at ti depends only on the previous two tags, and the proba- the one word stage, before overt syntax develops (Snow bility of word wi depends only on ti. This can be written and Balog, 2002). as It follows that sentence type information (whether a 0 (t,t0) (t,t0) sentence is declarative, imperative, or a question), as ti|ti−1 = t,ti−2 = t ,τ ∼ Mult(τ ) (1) (t) (t) signaled by intonation, is readily available to children wi|ti = t,ω ∼ Mult(ω ) (2) by the time they start to acquire syntactic categories. 0 Sentence type also has an effect on sentence structure where τ(t,t ) are the parameters of the multinomial dis- in many languages (most notably on word order), so tribution over following tags given previous tags (t,t0) 1 Proceedings of the 2010 Workshop on Cognitive Modeling and Computational Linguistics, ACL 2010, pages 1–8, Uppsala, Sweden, 15 July 2010. c 2010 Association for Computational Linguistics and ω(t) are the parameters of the distribution over out- puts given tag t. The BHMM assumes that these param- α eters are in turn drawn from symmetric Dirichlet priors ti−2 ti−1 ti with parameters α and β, respectively: 0 τ(t,t )|α ∼ Dirichlet(α) (3) (t) ω |β ∼ Dirichlet(β) (4) si wi β Using these Dirichlet priors allows the multinomial dis- N tributions to be integrated out, leading to the following predictive distributions: Figure 1: Graphical model representation of the BHMM-T, which includes sentence type as an ob- C(ti−2,ti−1,ti) + α P(ti|t−i,α) = (5) served variable on tag transitions (but not emissions). C(ti−2,ti−1) + Tα C(ti,wi) + β P(wi|ti,t−i,w−i,β) = (6) punctuation features (not part of our model), and these C(ti) +Wt β i features are informative enough to reliably distinguish where t−i = t1 ...ti−1, w−i = w1 ...wi−1, C(ti−2,ti−1,ti) sentence types (as speech recognition tasks have found and C(ti,wi) are the counts of the trigram (ti−2,ti−1,ti) to be the case, see Section 1). and the tag-word pair (ti,wi) in t−i and w−i, T is the In the BHMM, there are two obvious ways that sen- tence type could be incorporated into the generative size of the tagset, and Wti is the number of word types emitted by ti. model: either by affecting the transition probabilities Based on these predictive distributions, (Goldwa- or by affecting the emission probabilities. The first case ter and Griffiths, 2007) develop a Gibbs sampler for can be modeled by adding si as a conditioning variable the model, which samples from the posterior distri- when choosing ti, replacing line 1 from the BHMM bution over tag sequences t given word sequences w, definition with the following: i.e., P(t|w,α,β) ∝ P(w|t,β)P(t|α). This is done by us- 0 0 t |s = s,t = t,t = t0, (t,t ) ∼ Mult( (s,t,t )) (7) ing Equations 5 and 6 to iteratively resample each tag i i i−1 i−2 τ τ 1 ti given the current values of all other tags. The re- We will refer to this model, illustrated graphically in sults show that the BHMM with Gibbs sampling per- Figure 1, as the BHMM-T. It assumes that the distribu- forms better than the standard HMM using expectation- tion over ti depends not only on the previous two tags, maximization. In particular, the Dirichlet priors in the but also on the sentence type, i.e., that different sen- BHMM constrain the model towards sparse solutions, tence types tend to have different sequences of tags. i.e., solutions in which each tag emits a relatively small In contrast, we can add si as a conditioning variable number of words, and in which a tag transitions to few for wi by replacing line 2 from the BHMM with following tags. This type of model constraint allows (t) (s,t) the model to find solutions which correspond to true wi|si = s,ti = t,ω ∼ Mult(ω ) (8) syntactic parts of speech (which follow such a sparse, Zipfian distribution), unlike the uniformly-sized clus- This model, the BHMM-E, assumes that different sen- ters found by standard maximum likelihood estimation tence types tend to have different words emitted from using EM. the same tag. In the experiments reported below, we use the Gibbs The predictive distributions for these models are sampler described by (Goldwater and Griffiths, 2007) given in Equations 9 (BHMM-T) and 10 (BHMM-E): for the BHMM, and modify it as necessary for our own C(ti−2,ti−1,ti,si) + α extended models. We also follow (Goldwater and Grif- P(ti|t−i,si,α) = (9) C(ti− ,ti− ,si) + Tα fiths, 2007) in using Metropolis-Hastings sampling for 2 1 C(ti,wi,si) + β the hyperparameters, which are inferred automatically P(wi|ti,si,β) = (10) in all experiments. A separate β parameter is inferred C(ti,si) +Wti β for each tag. Of course, we can also create a third new model, the BHMM-B, in which sentence type is used to con- 2.2 BHMM with Sentence Types dition both transition and emission probabilities. This We wish to add a sentence type feature to each time- model is equivalent to training a separate BHMM on step in the HMM, signalling the current sentence type. each type of sentence (with shared hyperparameters). We treat sentence type (s) as an observed variable, on Note that introducing the extra conditioning variable the assumption that it is observed via intonation or in these models has the consequence of splitting the counts for transitions, emissions, or both. The split dis- 1Slight corrections need to be made to Equation 5 to ac- count for sampling tags from the middle of the sequence tributions will therefore be estimated using less data, rather than from the end; these are given in (Goldwater and which could actually degrade performance if sentence Griffiths, 2007) and are followed in our own samplers.