Catching Words in a Stream of Speech

Catching Words in a Stream of Speech:

Computational Simulations of
Segmenting Transcribed Child-Directed Speech

˘

Çagrı Çöltekin

CLCG

The work presented here was carried out under the auspices of the School of Behavioural and

Cognitive Neuroscience and the Center for Language and Cognition Groningen of the Faculty of

Arts of the University of Groningen.

Groningen Dissertations in Linguistics 97 ISSN 0928-0030

2011, Çag˘rı Çöltekin

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License. To view a

copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/

or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.

Cover art, titled auto, by Franek Timur Çöltekin

A

Document prepared with LT X 2 and typeset by pdfT X

ε

E

Printed by Wöhrmann Print Service, Zutphen
RIJKSUNIVERSITEIT GRONINGEN

Catching Words in a Stream of Speech

Computational Simulations of Segmenting Transcribed Child-Directed Speech

Proefschrift ter verkrijging van het doctoraat in de
Letteren aan de Rijksuniversiteit Groningen op gezag van de
Rector Magniﬁcus, dr. E. Sterken, in het openbaar te verdedigen op donderdag 8 december 2011 om 14.30 uur

door

Çag˘rı Çöltekin

geboren op 28 februari 1972
Çıldır, Turkije

Promotor:

Prof. dr. ir. J. Nerbonne

Beoordelingscommissie: Prof. dr. A. van den Bosch
Prof. dr. P. Hendriks Prof. dr. P. Monaghan

ISBN (electronic version): 978-90-367-5259-6

ISBN (print version):

978-90-367-5232-9

Preface

I started my PhD project with a more ambitious goal than what might have been achieved in this dissertation. I wanted to touch most issues of language acquisition, developing computational models for a wider range of phenomena. In particular, I

wanted to focus on models of learning linguistic ‘structure’, as it is typically observed

in morphology and syntax. As a result, segmentation was one of the annoying tasks that

I could not easily step over because I was also interested in morphology. So, I decided

to write a chapter on segmentation. Despite the fact that segmentation is considered

relatively easy (in comparison to learning syntax, for example) by many people, and

it is studied relatively well, every step I took for modeling this task revealed another

interesting problem I could not just gloss over. At the end, the initial ‘chapter’ became

the dissertation you have in front of you. I believe I have a far better understanding of

the problem now, but I also have many more questions than what I started with.

The structure of the project, and my wanderings in the landscape of language acquisition did not allow me to work with many other people. As a result, this dissertation has been completed in a more independent setting than most other PhD dissertations. Nevertheless, this thesis beneﬁted from my interactions with others. I

will try to acknowledge the direct or indirect help I had during this work, but it is likely

that I will fail to mention all. I apologize in advance to anyone whom I might have

unintentionally left out.

First of all, my sincere thanks goes to my supervisor John Nerbonne. Here, I do not use the word sincere for stylistic reasons. All PhD students acknowledge the

supervisor(s), but if you think they all mean it, you probably have not talked to many

of them. Besides the valuable comments on the content of my work, I got the attention

and encouragement I needed, when I needed it. He patiently read all my early drafts on

short notice, even correcting my never-ending English mistakes.

I would also like to thank Antal van den Bosch, Petra Hendriks and Padraic

Monaghan for agreeing to read and evaluate my thesis. Their comments and criticisms

improved the ﬁnal draft, and made me look at the issues discussed in the thesis from

different perspectives. In earlier stages of my PhD project, I also received valuable comments and criticisms from Kees de Bot and Tamás Biró. Although focus of the project changed substantially, the beneﬁt of their comments remain. Later, regular

discussions with fellow PhD students Barbara Plank, Dörte Hessler and Peter Nabende

vvi

kept me on track, and I particularly got valuable comments from Barbara on some of

the content presented here. Barbara and Dörte also get additional thanks for agreeing

to be my paranimphs during the defense.

A substantial part of completing a PhD requires writing. Writing at a reasonable

academic level is a difﬁcult task, writing in a foreign language is even more difﬁcult and

writing in a language in which you barely understand the basics is almost impossible.

First, thanks to Daniël de Kok for helping me with the impossible, and translating the

summary to Dutch. Second, I shamelessly used some people close to me for proof reading, on short notice, without much display of appreciation. My many mistakes

in earlier drafts of this dissertation were eliminated by the help of Joanna Krawczyk-

Çöltekin, Arzu Çöltekin, Barbara Plank and Asena and Giray Devlet. I am grateful for

their help, as well as their friendship.

My interest in language and language acquisition that lead to this rather late PhD

goes back to my undergraduate years in Middle East Technical University. I am likely

to omit many people here because of many years past since. However, the help I got

and things that I learned from Cem Bozs¸ahin and Deniz Zeyrek are difﬁcult to forget. I

am particularly grateful to Cem Bozs¸ahin for his encouragement and his patience in

supervising my many MSc thesis attempts.

This thesis also owes a lot to people that I cannot all name here. I would like to

thank to those who share their data, their code, and their wisdom. This thesis would not be possible without many freely available sources of information, tools and data that we

A

take for granted nowadays—just to name a few: GNU utilities, R, LT X, CHILDES.
E

There is more to life than research, and you realize it more if you move to a new

city. Many people made the life in Groningen more pleasant during my PhD time here. First, I feel fortunate to be in Alfa-informatica. Besides my colleagues in the

department, people of foreign guest club and international service desk of the university made my life more pleasant and less painful. I am reluctant to name people individually because of certainty of omissions. Nevertheless here is an incomplete list of people that

I feel lucky to have met during this time, sorted randomly: Ellen and John Nerbonne,

Kostadin Cholakov, Laura Fahnenbruck, Dörte Hessler, Ildikó Berzlánovich, Gisi

Cannizzaro, Tim Van de Cruys, Daniël de Kok, Martijn Wieling, Jelena Prokic´, Aysa

Arylova, Barbara Plank, Martin Meraner, Gideon Kotzé, Zhenya Markovskaya, Radek

Šimík, Jörg Tiedemann, Tal Caspi, Jelle Wouda.

My parents, Hos¸naz and Selçuk Çöltekin, have always been supportive, but also encouraged my curiosity from the very start. My interest in linguistics likely goes back to a description of Turkish morphology among my father’s notes. I still pursue

the same fascination I felt when I realized there was a neat explanation to something

I knew intuitively. My sister, Arzu, has always been there for me, not only as a supportive family member, but I also beneﬁted a lot from our discussions about my work, sometimes making me feel that she should have been doing what I do. Lastly, many thanks to two people who suffered most from my work on the thesis by being

closest to me, Aska and Franek, for their help, support and patience.

Contents

List of Abbreviations

xi

1

2
Introduction

The Problem of Language Acquisition

5

2.1 The nature–nurture debate . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Difﬁculty of learning languages . . . . . . . . . . . . . . . .
68
2.1.2 How quick is quick enough? . . . . . . . . . . . . . . . . . . 10 2.1.3 Critical periods . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Theories of language acquisition . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Parametric theories . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Connectionist models . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Usage based theories . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . 20

3

Formal Models

23

3.1 Formal models in science . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Computational learning theory . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Chomsky hierarchy of languages . . . . . . . . . . . . . . . . 26 3.2.2 Identiﬁcation in the limit . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Probably approximately correct learning . . . . . . . . . . . . 29
3.3 Learning theory and language acquisition . . . . . . . . . . . . . . . 31
3.3.1 The class of natural languages and learnability . . . . . . . . 33 3.3.2 The input to the language learner . . . . . . . . . . . . . . . . 35 3.3.3 Are natural languages provably (un)learnable? . . . . . . . . 37
3.4 What do computational models explain? . . . . . . . . . . . . . . . . 38 3.5 Computational simulations . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vii viii

Contents

4

Lexical Segmentation: An Introduction

45

4.1 Word recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Lexical segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Predictability and distributional regularities . . . . . . . . . . 50 4.2.2 Prosodic cues . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.3 Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.4 Pauses and utterance boundaries . . . . . . . . . . . . . . . . 55 4.2.5 Words that occur in isolation and short utterances . . . . . . . 55 4.2.6 Other cues for segmentation . . . . . . . . . . . . . . . . . . 56
4.3 Segmentation cues: combination and comparison . . . . . . . . . . . 57

5

Computational Models of Segmentation

59

5.1 The input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Processing strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Guessing boundaries . . . . . . . . . . . . . . . . . . . . . . 63 5.2.2 Recognition strategy . . . . . . . . . . . . . . . . . . . . . . 65
5.3 The search space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Exact search using dynamic programming . . . . . . . . . . . 68 5.3.2 Approximate search procedures . . . . . . . . . . . . . . . . 69 5.3.3 Search for boundaries . . . . . . . . . . . . . . . . . . . . . 71
5.4 Performance and evaluation . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Two reference models . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.1 A random segmentation model . . . . . . . . . . . . . . . . . 76 5.5.2 A reference model using language-modeling strategy . . . . . 76

6

Segmentation using Predictability Statistics

85

6.1 Measures of predictability for segmentation . . . . . . . . . . . . . . 86
6.1.1 Boundary probability . . . . . . . . . . . . . . . . . . . . . . 86 6.1.2 Transitional probability . . . . . . . . . . . . . . . . . . . . . 88 6.1.3 Pointwise mutual information . . . . . . . . . . . . . . . . . 90 6.1.4 Successor variety . . . . . . . . . . . . . . . . . . . . . . . . 91 6.1.5 Boundary entropy . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.6 Effects of phoneme context . . . . . . . . . . . . . . . . . . . 94 6.1.7 Predicting the past . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.8 Predictability measures: summary and discussion . . . . . . . 100
6.2 A predictability based segmentation model . . . . . . . . . . . . . . . 102
6.2.1 Peaks in unpredictability . . . . . . . . . . . . . . . . . . . . 102 6.2.2 Combining multiple measures and varying phoneme context . 106 6.2.3 Weighing the competence of the voters . . . . . . . . . . . . 109 6.2.4 Two sides of a peak . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.5 Reducing redundancy . . . . . . . . . . . . . . . . . . . . . . 112

Contents

ix

Learning from Utterance Boundaries 115

7.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Do utterance boundaries provide cues for word boundaries? . . . . . . 117 7.3 An unsupervised learner using utterance boundaries . . . . . . . . . . 119 7.4 Combining measures and cues . . . . . . . . . . . . . . . . . . . . . 122 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7

8

Learning From Past Decisions

127

8.1 Using previously discovered words . . . . . . . . . . . . . . . . . . . 128
8.1.1 What makes a good word? . . . . . . . . . . . . . . . . . . . 130 8.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2 Using known words for segmentation: a computational model . . . . 132
8.2.1 The computational model . . . . . . . . . . . . . . . . . . . 132 8.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3 Making use of lexical stress . . . . . . . . . . . . . . . . . . . . . . . 136
8.3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.3.2 An analysis of stress patterns in the BR corpus . . . . . . . . 139 8.3.3 Segmentation using lexical stress . . . . . . . . . . . . . . . 141
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

9

An Overview of the Segmentation Strategies

147

9.1 The measures and the models . . . . . . . . . . . . . . . . . . . . . . 147 9.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 9.3 Making use of prior information . . . . . . . . . . . . . . . . . . . . 152 9.4 Variation in the input . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.5 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.6 The learning method . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

10 Conclusions

165

169 187 191 193 197
Bibliography Short Summary Samenvatting in het Nederlands A The Corpora B Detailed Performance Results

ANN APS
Artiﬁcial Neural Network Argument from poverty of stimulus
17 35

BF BP BR BTP
Boundary F-score Boundary precision Boundary recall
73 73 73

99

Backwards or reverse TP

CDS CHILDES
Child Directed Speech Child language data exchange system
61 46

H

Boundary entropy

94

IIL

Identiﬁcation in the limit also used as ‘Identiﬁable 27

in the limit’
LAD LF LP
Language Acquisition Device Lexical (or word-type) f-score Lexical (or word-type) precision Lexical (or word-type) recall
673 73

73

LR

MDL MI
Minimum description length (Pointwise) Mutual Information
65 90

PAC PM PUM
Probably approximately correct (learning) Predictability-based Segmentation model

Segmentation model that uses predictability and ut- 123

terance boundaries
29 113

PUWM

Segmentation model that uses predictabilty, utter- 135

ance boundaries and know words

xi xii

PUWSM PV

Segmentation model that uses all cues: predictabilty, 143

utterance boundaries, known words and stress

Predecessor Value

99

76

RM

Random Model

SM SRN SV
Stress-based Segmentation model Simple recurrent network Successor Variety
142 18, 63 91

TP

Transitional probability

88

UG UM
Universal Grammar Utterance boundary-based Segmentation model
6123

VC dimension Vapnik-Chervonenkis dimension

30

WF WM WP WR
Word F-score Lexicon-based Segmentation model Word precision
73 132 73

Word recall

73

1 Introduction

A witty saying proves nothing.
Voltaire

The aim of language acquisition research is to understand how children learn languages spoken in their environment. This study contributes to this purpose by

investigating one of the ﬁrst steps of the language acquisition process, the discovery of

words in the speech stream directed to children, by means of computational simulations.

We take words for granted, we identify them effortlessly when listening to others

speaking a language we understand, and we use them to construct utterances possibly

never uttered before. We learn the sound forms of the words, associate them with

meanings, discover how to use them appropriately in company of other words, and in

presence of different people. Despite apparent ease with which we process and learn

words, learning a proper set of words, a lexicon, to effectively communicate with our

environment is a challenging task. The challenge starts with identifying these words in

a continuous speech stream. Unlike written text where we typically put white spaces

between the words, the speech signal does not contain analogous reliable markers for

word boundaries.

A competent language user is aided, to some extent, by his/her knowledge of

words to extract them from a continuous stream: itisannoyingbutyouprobablycanﬁgure-

outthewordsinthissequence. However, at the beginning of their journey to becoming

competent speakers, children do not know the words in the language they are acquir-

ing. As a result they cannot make use of words. If this is not convincing, try to

locate the word boundaries in this sentence: eg˘ertürkçebilmiyorsanızbudizidekisözcük-

leribulmanızçokzor. This is approximately what happens when you hear an unfamiliar

language (in this case, Turkish). Without knowing the words of the input language,

discovering words in a continuous speech stream does not seem possible, which leads

to a chicken-and-egg problem. In spoken language, we are not as helpless as in the

written stream of letters. There are several acoustic cues that indicate word boundaries.

However, although these cues correlate with the boundaries, they are known to be

insufﬁcient, noisy and sometimes in conﬂict with each other. Furthermore, the cues are

language dependent, that is, one needs to know the boundaries to learn when and how

1
2

Introduction

these cues correlate with the boundaries. We are back to the chicken-and-egg problem

again.

Fortunately, there are also some simple and general segmentation strategies that seem to work universally for all languages. A byproduct of the fact that the natural speech stream is formed by concatenating words, the ﬂow of basic units (such as

syllables or phonemes) in an utterance follows certain statistical regularities. Particu-

larly, the basic units within words predict one another in sequence, while units across

boundaries do not. It is even more encouraging that children seem to be sensitive to

these statistics at a very young age. Another source of information for word boundaries

that does not require knowledge of words in advance comes from utterance bound-

aries. Utterance boundaries are also word boundaries, and words are formed by certain

regularities, for example they share common beginnings and endings. This provides

another source for discovering words before knowing them. Once we start discovering

words using these general strategies, we can also learn to use the language-speciﬁc

Catching Words in a Stream of Speech

Downloaded by [New York University] at 06:54 14 August 2016 Classic Case Studies in Psychology

The Archidoxes of Magic, 2010, Theophrastus Paracelsus, 1163581216, 9781163581216, Kessinger Publishing, 2010

SAVING LITERACY SAVING Dr

Chapter Five Considering Linguistic Knowledge

SCHOOL of EDUCATION, HEALTH and SCIENCE a Thesis Submitted

ALTERNATIVE (RE)VIEWS the Language Myth

True Stories of Childhood Misfortune and Meaningless Sex

Unsupervised Language Acquisition: Theory and Practice

Table 6: Correlations of Length of Audio Recording of Time with Accuracy and RT of Responses

New Perspectives on Language Development and the Innateness of Grammatical Knowledge Christophe Parisse

Fostering Diversity and Minimizing Universals: Toward a Non-Colonialist Approach to Studying the Acquisition of Algonquian Languages J

New Foundations for General Linguistics Victor H. Yngve