Combining Utterance-Boundary and Predictability Approaches To

Combining Utterance-Boundary and Predictability Approaches to Speech Segmentation Aris XANTHOS Linguistics Department, University of Lausanne UNIL - BFSH2 1015 Lausanne Switzerland [email protected] Abstract esize boundaries inside utterances (Aslin et al., This paper investigates two approaches to 1996; Christiansen et al., 1998; Xanthos, 2004). speech segmentation based on different heuris- The second approach is based on the predictabil- tics: the utterance-boundary strategy, and the ity strategy, which assumes that speech should predictability strategy. On the basis of for- be segmented at locations where some mea- mer empirical results as well as theoretical con- sure of the uncertainty about the next symbol siderations, it is suggested that the utterance- (phoneme or syllable for instance) is high (Har- boundary approach could be used as a prepro- ris, 1955; Gammon, 1969; Saffran et al., 1996; cessing step in order to lighten the task of the Hutchens and Adler, 1998; Xanthos, 2003). predictability approach, without damaging the Our implementation of the utterance- resulting segmentation. This intuition leads to boundary strategy is based on n-grams the formulation of an explicit model, which is statistics. It was previously found to perform empirically evaluated for a task of word segmen- a \safe" word segmentation, that is with a tation on a child-oriented phonemically tran- rather high precision, but also too conser- scribed French corpus. The results show that vative as witnessed by a not so high recall the hybrid algorithm outperforms its compo- (Xanthos, 2004). As regards the predictability nent parts while reducing the total memory load strategy, we have implemented an incremental involved. interpretation of the classical successor count (Harris, 1955). This approach also relies on the 1 Introduction observation of phoneme sequences, the length The design of speech segmentation1 methods of which is however not restricted to a fixed has been much studied ever since Harris' sem- value. Consequently, the memory load involved inal propositions (1955). Research conducted by the successor count algorithm is expected since the mid 1990's by cognitive scientists to be higher than for the utterance-boundary (Brent and Cartwright, 1996; Saffran et al., approach, and its performance substantially 1996) has established it as a paradigm of its own better. in the field of computational models of language The experiments presented in this paper were acquisition. inspired by the intuition that both algorithms In this paper, we investigate two boundary- could be combined in order to make the most based approaches to speech segmentation. Such of their respective strengths. The utterance- methods \attempt to identify individual word- boundary typicality could be used as a compu- boundaries in the input, without reference to tationally inexpensive preprocessing step, find- words per se" (Brent and Cartwright, 1996). ing some true boundaries without inducing too The first approach we discuss relies on the many false alarms; then, the heavier machinery utterance-boundary strategy, which consists in of the successor count would be used to accu- reusing the information provided by the occur- rately detect more boundaries, its burden be- rence of specific phoneme sequences at utter- ing lessened as it would process the chunks pro- ance beginnings or endings in order to hypoth- duced by the first algorithm rather than whole 1 utterances. We will show the results obtained To avoid a latent ambiguity, it should be stated that for a word segmentation task on a phonetically speech segmentation refers here to a process taking as input a sequence of symbols (usually phonemes) and pro- transcribed and child-oriented French corpus, ducing as output a sequence of higher-level units (usually focusing on the effect of the preprocessing step words). on precision and recall, as well as its impact on 93 memory load and processing time. The absolute frequency of an n-gram w 2 Sn T The next section is devoted to the formal def- in the corpus is given by n(w) := Pt=1 nt(w) inition of both algorithms. Section 3 discusses where nt(w) denotes the absolute frequency of w some issues related to the space and time com- in the t-th utterance of C. In the same way, we plexity they involve. The experimental setup define the absolute frequency of w in utterance- T as well as the results of the simulations are de- initial position as n(wjI) := Pt=1 nt(wjI) where scribed in section 4, and in conclusion we will nt(wjI) denotes the absolute frequency of w in summarize our findings and suggest directions utterance-initial position in the t-th utterance for further research. of C (which is 1 iff the utterance begins with w and 0 otherwise). Similarly, the absolute fre- 2 Description of the algorithms quency of w in utterance-final position is given T 2.1 Segmentation by thresholding by n(wjF) := Pt=1 nt(wjF). Many distributional segmentation algorithms Accordingly, the relative frequency of w ob- described in the literature can be seen as in- tains as f(w) := n(w)= Pw~2Sn n(w ~). Its stances of the following abstract procedure relative frequencies in utterance-initial and (Harris, 1955; Gammon, 1969; Saffran et al., -final position respectively are given by f(wjI) := n(wjI)= Pw~2Sn n(w ~jI) and f(wjF) := 1996; Hutchens and Adler, 1998; Bavaud and 2 Xanthos, 2002). Let S be the set of phonemes n(wjF)= Pw~2Sn n(w ~jF) . (or segments) in a given language. In the most Both algorithms described below process the general case, the input of the algorithm is an input incrementally, one utterance after an- utterance of length l, that is a sequence of l other. This implies that the frequency measures phonemes u := s1 : : : sl (where si denotes the defined in this section are in fact evolving all i-th phoneme of u). Then, for 1 ≤ i ≤ l − 1, we along the processing of the corpus. In general, insert a boundary after si iff D(u; i) > T (u; i), for a given input utterance, we chose to update where the values of the decision variable D(u; i) n-gram frequencies first (over the whole utter- and of the threshold T (u; i) may depend on both ance) before performing the segmentation. the whole sequence and the actual position ex- amined (Xanthos, 2003). 2.3 Utterance-boundary typicality The output of such algorithms can be evaluated in reference to the segmentation performed We use the same implementation of the by a human expert, using traditional measures utterance-boundary strategy that is described from the signal detection framework. It is usual in more details by Xanthos (2004). Intuitively, to give evaluations both for word and boundary the idea is to segment utterances where se- detection (Batchelder, 2002). The word preci- quences occur, which are typical of utterance sion is the probability for a word isolated by boundaries. Of course, this implies that the cor- the segmentation procedure to be present in pus is segmented in utterances, which seems a the reference segmentation, and the word recall reasonable assumption as far as language acqui- is the probability for a word occurring in the sition is concerned. In this sense, the utterance- true segmentation to be correctly isolated. Sim- boundary strategy may be viewed as a kind of ilarly, the segmentation precision is the proba- learning by generalization. bility that an inferred boundary actually occurs Probability theory provides us with a straightforward way of evaluating how much an in the true segmentation, and the segmentation n recall is the probability for a true boundary to n-gram w 2 S is typical of utterance end- be detected. ings. Namely, we know that events \occur- In the remaining of this section, we will use rence of n-gram w" and \occurrence of an n- this framework to show how the two algorithms gram in utterance-final position" are indepen- we investigate rely on different definitions of dent iff p(w \ F) = p(w)p(F) or equivalently D(u; i) and T (u; i). iff p(wjF) = p(w). Thus, using maximum- likelihood estimates, we may define the typical- 2.2 Frequency estimates ∗ Let U ⊆ S be the set of possible utterances in 2 n w Note that in general, Pw~2Sn ( ~jF) = the language under examination. Suppose we n w T~ T~ T T Pw~2Sn ( ~jI) = , where ≤ is the number are given a corpus C ⊆ U made of T successive of utterances in C that have a length greater than or utterances. equal to n. 94 ity of w in utterance-final position as: 2.4 Successor count The second algorithm we investigate in this pa- f(wjF) per is an implementation of Harris' successor t(wjF) := (1) f(w) count (Harris, 1955), the historical source of all predictability-based approaches to segmen- This measure is higher than 1 iff w is more likely tation. It relies on the assumption that in gen- to occur in utterance-final position (than in any eral, the diversity of possible phonemes tran- position), lower iff it is less likely to occur there, sitions is high after a word boundary and de- and equal to 1 iff its probability is independent creases as we consider transitions occurring fur- of its position. ther inside a word. The diversity of transitions following an n- In the context of a segmentation procedure, n this suggest a \natural" constant threshold gram w 2 S is evaluated by the successor T (u; i) := 1 (which can optionally be fine-tuned count (or successor variety), simply defined as in order to obtain a more or less conservative the number of different phonemes that can oc- result).

Combining Utterance-Boundary and Predictability Approaches To

An Analysis of Sentence Segmentation Features For

A Bayesian Framework for Word Segmentation: Exploring the Effects of Context

The Induction of Phonotactics for Speech Segmentation

Automatic Speech Segmentation

A Semi-Markov Model for Speech Segmentation with an Utterance-Break Prior

Seminar, Tue 4:15Pm

The Polysemy of the Words That Children Learn Over Time Arxiv

AUTOMATIC LINGUISTIC SEGMENTATION of CONVERSATIONAL SPEECH Andreas Stolcke Elizabeth Shriberg

Segmentation and Morphology

Unsupervised Segmentation of Audio Speech Using the Voting Experts Algorithm Matthew Im Ller Adam Miller Iowa State University

The Universal Dependencies Treebank of Spoken Slovenian

Sentence Boundary Detection Based on Parallel Lexical and Acoustic Models