Analysing the Correspondence Between Automatic Prosodic Segmentation and Syntactic Structure
Total Page:16
File Type:pdf, Size:1020Kb
INTERSPEECH 2011 Analysing the correspondence between automatic prosodic segmentation and syntactic structure Gyorgy¨ Szaszak´ 1, Katalin Nagy1, Andras´ Beke2 1Department of Telecommunication and Media Informatics, Budapest University for Technology and Economics, Budapest, Hungary 2Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary [email protected], [email protected], [email protected] Abstract This prosodic structure is often represented as a tree or bracket- ing of the utterance. Prosody and syntax are highly related, even if the prosodic During the syntactic analysis of sentences, a similar hierar- structure cannot be directly mapped to the syntactic one and chy is supposed to exist, which connects the basic constituents vice versa. This paper presents an experiment for exploring in (e.g. words) of a sentence. Words form groups, called also what degree a powerful HMM-based automatic prosodic seg- phrases (XP). Phrases can embed other (daughter) phrases and mentation tool can recover the syntactic structure of an utter- in this way, a layered hierarchy exists among them. The phrase ance in speech understanding systems. Results show that the is usually named after its dominant word called head, which is approach is capable of recalling up to 92% of syntactic clause the word that determines the behaviour of the entire word group boundaries and up to 71% of embedded syntactic phrase bound- within the constituent situated one level higher in the hierarchy. aries based on the detection of phonological phrases. Recall In this way, one can distinguish noun phrases (NP) with a head rates do not depend further on the syntactic level (whether the of type noun, verbal phrases (VP), adverbial phrases (AdvP) phrase is multiply embedded or not), but clause boundaries can and so on. The syntactic analysis of a sentence usually ends in be well separated from lower level syntactic phrases based on a hierarchical tree-form representation. the type of the aligned phonological phrase(s). These findings In speech technology, syntactic analysis of a written sen- can be exploited in speech understanding systems, allowing for tence is a common task prior to speech synthesis. The first ini- the recovery of the skeleton of the syntactic structure, based tiatives date back even to 1980s. The underlying assumption for purely on the speech signal. this approach is that the required prosodic features of the sen- Index Terms: prosody, syntax, phonological phrase, syntactic tence to be synthesized can be well predicted relying on syntac- analysis, speech understanding tic analysis. In other words, one assumes that surface syntactic structure determines the prosodic structure. As this determina- 1. Introduction tion is only partial, applications were often restricted to well described domains, such as name and address synthesis [5]. Exploring the relationship between prosodic and syntactic All of the above presented applications of the syntax- structures of human speech has motivated much research ac- prosody interface are directed from syntax to prosody, which 10.21437/Interspeech.2011-399 tivity. However, a uniform model, capable of describing the means that it is usually the prosody that should be predicted by interface between syntactic representation of a sentence and its analysing the syntax. The inverse direction, e.g. deducing some phonological representation still does not exist. Of course, the syntactical attributes based on prosody is less frequently used. phenomenon is complex and probably the claim for such a uni- An interesting exploitation of the prosody to syntax mapping is form model is not realistic. Nevertheless, one common issue using prosody in syntactic disambiguation. Price et al. found in shared by the majority of theories dealing with the relationship perception tests that prosody helps syntactic disambiguation in syntax-prosody is that they are closely related, but this rela- many - even if not in all - cases [6]. They have also proposed tion cannot be expressed as a definite mapping between syn- an automatic prosodic labeller which used normalized duration tax and prosody. Selkirk apostrophes this approach as prosodic features to provide break classification. Disambiguation based structure hypothesis which holds that the prosodic structure of on prosody is also known in automatic speech recognition. a sentence is related to (but not fully dependent on) the sur- Disambiguation is often illustrated by literally identical face syntactic structure [1]. In contrast, some theories argue sentence pairs, which have two or even more syntactical anal- that prosody is directly governed by the surface syntactic phrase ysis hypotheses. Instead of creating such less or more artificial structure [2], but experience shows rather that the relationship sentence pairs, which usually provide only a moderate and of- syntax-prosody is more difficult, especially as we are approach- ten non-realistic corpus for analysing purposes, the authors de- ing to the lower levels of the prosodic structure hierarchy. cided to analyse the correspondence between the automatically The prosodic structure is hypothesized to be arranged in a aligned phonological phrase boundaries (these represent an au- hierarchy [1], [3] as follows (top-down): the utterance is com- tomatic prosodic analysis) and automatically generated and pre- posed of intonational phrases (IP), which can be divided into viously disambiguated syntactic boundaries (these represent the phonological phrases (PP). Phonological phrases are composed surface syntactic structure) on a big speech corpus. The basic of phonological words PW, called sometimes prosodic words. interest is to see and measure in what degree the different lev- The hierarchy can be further refined until the syllable level, but els (called also layers [1]) in the prosodic and syntactic hier- units inferior to PW (or even PP) will not be used in this paper. archy can be mapped to each other, and to analyse further if Copyright © 2011 ISCA 1057 28-31 August 2011, Florence, Italy any type of phonological or syntactic phrase exists which has proach, continuous tracking of prosody is implemented instead a special impact on syntactic or phonological structure, respec- of looking for discrete indices of breaks or tones. This pro- tively. An important outcome of the experiment is to evaluate vides a soft and more flexible framework, which is believed to in what degree an automatic prosodic segmentation for phono- be closer to human perception processes. logical phrases can reflect the underlying syntactic structure. This paper is organized as follows: first, the automatic 2.1. Acoustic-prosodic pre-processing prosodic segmentation approach is detailed and the syntactic analyser is briefly presented. The next section reviews the used Fundamental frequency (F0) is extracted by ESPS method us- data and the used approach. Finally, results are shown and dis- ing a 25 ms long window. Intensity is also computed with a cussed, then conclusions are drawn. window of 25 ms. The frame rate for both variables is set to 10 ms. The obtained F0 contour is firstly filtered with an anti- octave jump tool. This is followed by a smoothing with a 5 point 2. Automatic prosodic segmentation of mean filter. F0 is linearly interpolated in log domain. The in- speech terpolation is omitted for voiceless sections longer than 150 ms F For the recovery of the prosodic structure directly from the and also for 0-rises higher than 110% after an unvoiced part. Delta and acceleration coefficients are also appended to both F0 speech signal, automatic prosodic segmentation is used, de- and intensity streams (more details can be found in [4]). scribed in details in [4] and [7]. The prosodic segmenter is intended to recognize and align a sequence of phonological phrases (PP). This is done in a Hidden Markov Model (HMM) 2.2. Prosodic segmentation vs. word-boundaries based alignment and recognition framework, which uses as In earlier works of the authors [4] [7], prosodic segmentation acoustic models 7 different models of phonological phrases (see for phonological phrases was used to recover word-boundaries Table 1). The classification of PPs is performed based on their in Hungarian and Finnish languages. This was feasible, because associated intonation contour, represented by fundamental fre- (i) the prosodic-acoustic models of phonological phrases were quency and energy features (see section 2.1). PPs are defined trained on samples in which phonological phrase boundaries co- as prosodic units which have their own stress and intonation incided with word-boundaries; (ii) the languages analysed were contour [3]. The standard intonation contour of a PP shows a fixed stress language, this means that if a word or word group smart rise at the stress (the modelled Hungarian language has (syntactic phrase) is stressed, stress can be found always on the fixed first syllable stress), then follows a descending contour. first syllable; (iii) in high correlation with the first-syllable fixed However, as PPs are embedded into the utterance, the higher stress, Hungarian shows a left branching structure in syntactic level utterance intonation influences their characteristics. This analysis [10] (in contrast, English is right branching [1]). explains why 7 different models were trained. Clause onset (co) and clause ending (ce) usually alter the standard PP intonation contour, so does the focus (strongly stressed PP, ss) and the con- 3. Syntactic analysis tinuation rise