<<

cmp-lg/9407013 15 Jul 94 embedding and many sentences childrenmented hear sp eech are is quite somewhat short unnatural on average in our test database with little converter eect on the work presented here child learns in presented with those collectively mean sist of an This pap er is concerned with the machinelearning of a from utterances that con of the child encounters for the computer by predigesting the sp eech stream into a sequence a big ball sentation meaning 2 1 phonemes and For The Children do in fact hear some pause information and thus restricting our algorithm to totally unseg and quences Childes synonymy utterance only a single coherent and learns in the presence of homonymy Introduction The instance particular We which unsegmented sequence paired with a semantic representations from larger utterances of unsegmented for present database of motherchild and semantic representations Phoneme that is Acquisition and phonemes frequently the by Phoneme Sequence signal an noise MIT computer providing khirz algorithm used based inaccurate Cambridge Test Articial Representations cgdemarca im it ed u gb a continuous in Sequences might results an up on the that But as data in suggests short utterances often are pauseless l Carl exact transparent The problem is mo deled after the environment that a Technology witness interactions are presented of pap er b e acquires over a visual Intelligence f Abstract de BE presented a MA The algorithm maintains from utterance to are OK Marcken sp eech signal THE Lexicon corpus stimuli the words Set OK Square output and with of BALL pairings Lab oratory We utterances k and USA semantic representation of what of 1 the HERE radically Semantic and unambiguous a Its from public utterance of mislab elings do otentially hypothesizing BIG generated phonological domain simplify the g Paired 2 phoneme OK semantic repre texttophoneme from not forms heres the se problem have the any

From this and other utterances our goal is to pro duce an algorithm that learns a

lexicon containing

hir f HERE g z f BE g k f OK g

bg f BIG g bl f BALL g f THE g

It is not sucient for our algorithm to learn any mapping from phoneme sequences to

semantic that explains the training data Many mappings will do so including the

trivial one in which the nal lexicon contains the training utterances themselves We ask

that the algorithm learn a lexicon that generalizes well and do es so without recourse to

cognitively implausible mechanisms such as oline algorithms that access large amounts

of the corpus at once

This problem is interesting for several reasons Although considerable study has b een

devoted to the acquisition of formal and natural grammars grammatical categories for

words and even phonological pro cesses the acquisition of the lexicon has b een largely

neglected by the computational and machine learning communities despite

growing agreement that most variation stems from there Thus from a cognitive

science and articial intelligence viewp oint this problem is a fundamental but relatively

unstudied part of the pro cess of learning a natural language and a prerequisite to the

acquisition of grammar It also is a task where mo del complexity must b e traded against

global coverage in an environment where only a small prop ortion of the data is available

at any one time to the algorithm

Of course the work presented here presumes a gross simplication of the real task

faced by a computer or child that seeks to learn from sound pressure waveforms and sen

sory stimuli Children may well not have the innate capability to a sp eech stream

into discrete alphab etic sound units like phonemes this is a matter of some debate in the

eld and even if they do is highly contextdependent This is not

necessarily an insurmountable problem the metho ds used here can easily b e extended

to handle some sound variation in words and section discusses more recent work that

accepts input closer to what current technology could derive from sound waves A bigger

assumption is the uniqueness and simplicity of the semantic representation for each utter

ance the simplicity of our semantic representation b oth complicates the learning pro cess

for our algorithm by eliminating information that could constrain the search in conjunc

tion with grammar for example and simplies the task by reducing the complexity of

the information to b e learned The unambiguous interpretation we implicitly assume the

child can give an utterance in an environment may or may not b e trivializing the task

given our limited knowledge of child psychology it is dicult to tell But some metho ds

for increasing the ambiguity of semantic interpretation for an utterance are discussed in

the next section and in section

Previous Work

Olivier and Cartwright and Brent present simple algorithms that segment text

and phoneme sequences by learning from statistical irregularities In particular they

place phoneme sequences in a dictionary if those phonemes o ccur consecutively more

often than they would if phonemes were selected by a memoryless random pro cess with

identical aggregate distribution Oliviers algorithm is online extremely ecient and

incorp orates no priors Cartwright and Brents is a batch description formulation

that uses the size of the dictionary as a prior Their algorithms p erform with minimal

adequacy unable to distinguish correlations due to the dictionary from correlations due

to syntax and

Brown et al present a statistical machine algorithm that makes use

of estimated corresp ondances b etween words in English and French From an aligned

multilingual database they estimate for every English word the distribution of French

words it might translate to including the number of words it will translate to So they

estimate that not will translate to pas ne jamais and that

it translates into words with probability ne pas If we were to segment our

phonemic input into words we could view our problem as acquiring translation data from

the language of sound to the language of meaning or viceversa Brown et al of course

assume segmentation of the source and target language and make multiple passes over

the data They also make use of generally linear corresp ondances b etween utterances in

the two

Siskind has presented a series of algorithms that learn wordmeaning

asso ciations when presented with paired sequences of tokens and semantic representations

Our mo del of our problem is based on his work His work diers from ours in two

principle ways rst Siskind learns more complex semantic representations Jackendo

3

style semantic formulae rather than simple sets in an environment where his

algorithms are presented with many ambiguous semantic representations one of which

is correct and second Siskinds work assumes presegmented tokens as input So his

algorithm receives ok here is the big ball rather than our khirzbigbl Siskinds

work has tended to rely on classical search metho ds and maintains a dictionary that may

contain a variety of concurrent hypotheses for any given word

See the ab ove pap ers for references to other related work and a further discussion of

the motivation for this line of research

The Algorithm

The algorithm presented here maintains a single dictionary a list of words Each word is

a triple of a phonemesequence a set of semantic symbols and a condence

factor called the temp erature The temp erature aects the likelihoo d that the word will b e

sp ontaneously deleted from the dictionary and also the ability of the word to participate

in the creation of new words Thus a high temp erature near implies that a word is

likely to b e involved with new word creation and to b e deleted and a low temp erature

near implies that the word is static and unlikely to b e deleted

3

In recent work Siskind has separated the learning of the set of semantic primitives asso ciated with

a word from the learning of the relations b etween those primitives Borrowing from his work it would

not b e dicult to extend our algorithm to learn semantic formulae rather than merely sets of semantic

primitives Similarly the metho d Siskind uses to disambiguate b etween multiple ambiguous semantic

interpretations for an utterance is equally applicable here He essentially uses Bayes theorem to calculate

the probability of a meaning given a word sequence from the probability of the word sequence given the

meaning which can b e very eective if the prop er denitions of some of the words in the utterance are

known

When the algorithm is presented with an utterance it p erforms a lo cal variation of

the exp ectationmaximization EM pro cedure it attempts to parse the utterance

using the words in its dictionary resulting in values for hidden wordactivation variables

By parse we mean that the algorithm attempts to nd a set of words that collectively

cover all the phonemes and sememes of the utterance without overlap or mismatched

elements After the parse is complete the maximization step o ccurs with mo dication

of the dictionary to reduce the error of activated words new warm words are added to

the dictionary to account for unparsed p ortions of the utterance and variations of words

4

used in the parse are added to the dictionary to x mismatched or overparsed p ortions

of the utterance Periodically words are deleted from the dictionary if they have not b een

co oled by b eing used a brand of prior that favors a minimalsize dictionary

ProcessUtteranceu d f

let words dictionarywordsd

x let matches matchwordsu words

1 1 1w 1

parseu matches S x let E P

j i

1w 1 1

x let newwords createnewwordsu matches P S

i j

x let newmatches matchwordsu newwords

2 2 2w 2

parseu matchesnewmatches S x let E P

j i

2 2w 2

x coolwordsmatchesnewmatches E P

i

x addcooledwordsnewmatches d

x garbagecollectdictionaryd

x reducedictionaryd

g

Figure Pseudo co de for the pro cedure the learning algorithm applies to each utterance

w

u is an utterance d is the dictionary E is an error scalar is a vector of word

activations P is a vector of phonemic deviances and S is a vector of

i j

semantic deviances Subroutines are describ ed b elow in the sections listed

Figure presents pseudo co de for the algorithm The subroutines used by this algo

rithm are describ ed in more detail b elow

Matching

A word may o ccur at dierent p oints in a phonemic utterance The matchwords func

tion of the algorithm nds places in an utterance that a word might o ccur and generates

an evaluation of how closely it matches there It do es this by creating a vector the

5

phonemicmatch or PM vector that describ es in terms of numbers from to how

well a word accounts for each phoneme in the utterance and a similar SM vector for

4

By overparsed we mean p ortions of the utterance that are accounted for by more than one word

5

Right now only or are used for a p erfect match and for anything else This scheme anticipates

using intermediate values to represent phonemes likely to b e related by phonological pro cesses such as

t and d

the sememes It also computes two scalars P M and S M that represent mismatched

phonemes and sememes such as a sememe in the word but not in the utterance

Word Position PM SM P M S M

mn f THE MAN g

f THE g

f THE g

m f THEM g

man f MAN g

Figure Some of the data pro duced by matchwords to evaluate how well the words

the them and man match up with the utterance the men The p osition reects the oset

of the phonemic word into the utterance Thus the word matches well phonetically

when oset and p o orly when oset

Figure contains a sample of what the matchwords function pro duces for the

utterance the men assuming the dictionary contains reasonable denitions for the words

the them and man The match with the word the oset into the utterance by has a

suciently p o or phonemic match that matchwords would lter it out The function

returns a list of these matches and the asso ciated vectors and scalars

Parsing

In order to evaluate how well the current dictionary accounts for an utterance the al

gorithm attempts to fully parse the utterance with words from the dictionary placing

words in such a fashion that each phoneme of the utterance is covered by a phoneme from

exactly one dictionary word with no extra phonemes b eing contributed by any word

and each sememe from the utterance is covered by a sememe from exactly one dictionary

word with no extra sememes b eing contributed by any word

These desiderata can b e mo deled by giving each word match w an activation co ecient

w w w

b etween and If the word do es not participate in the parse If the

word participates fully A p erfect parse meets the following conditions using nonfractional

activations

X X

w w w w

PM SM

w w

X X

w w

w w

M M S P

w w

The rst two conditions guarantee that every phoneme and sememe is covered exactly

6

once by the words in the utterance The second two guarantee that these words are

6

Actually the semantic target is not necessarily a uniform vector since some sememes may o ccur

multiple times in the utterance One might alternatively leave the target vector at a uniform and adjust

P

w

w

the success requirement to SM A pause can b e represented with a phonemic target

i

i

not also contributing any extraneous phonemes or sememes Of course it may not b e

p ossible to meet these conditions given the available dictionary at least not without

fractional activations So we parse with the goal of minimizing a global error function If

P

w

w

we let P PM the distance b etween the current total activation of the ith

i

i

phoneme and its target of and similarly for sememes S our global error function is

j

X X X X

w w

w w

P f S c f P c S M c M E c

j 3 i 2 4 1

w w

j i

Here i sums over the length of the phonemic utterance j over the number of dierent

sememes in the utterance and w over the word matches The f function must b e carefully

chosen to result in a concentration of error in single phonemes or sememes rather than a

7

distribution over the parse and to p enalize overparsing a phoneme or sememe

w

We can minimize E by varying the activation vector A simple gradient

descent search from a randomly placed starting vector p erforms adequately for the par

ticular vectors that arise here The end result of the parsing pro cess is a tuple of the nal

w

minimized error E the activation vector and the deviation vectors P

i

and S Thus at the end of the parsing pro cess we know not only how much each

i

w

word participates in the parse but also which phonemes and sememes are under or

overparsed P and S In the terms of the EM framework we now have an estimate

i j

of the hidden variables the word activations

Creating New Words

Once a parse of an utterance has b een completed the algorithm has some sense of what

words participated in the utterance and what was misparsed it now can p erform the

maximization step of mo difying the dictionary It creates new words using a variety of

metho ds that have proven successful but are not in any way the only ones that might

work Some of the metho ds used in this pro cess are similar to those used by Siskind

We can divide the metho ds into two parts xing words that participated in the parse and

creating wholly new words Fixes include deleting and adding phonemes and sememes

from a denition

Words participate in xes with some probability In the case of semantic xes that

probability is prop ortional to the words activation and its temp erature This prevents a

cold word from participating in any semantic changes In the case of phonemic xes the

probability is prop ortional only to the words activation This makes it easy for a fully

frozen word say cucumbers to create a new word cucumber with the same meaning but

dicult for cucumbers to change its meaning to f AVOCADO g The xes that a word can

participate in are

Remove sememes from the word if they do not o ccur in the utterance sememe set

or are overparsed S c

i

7

The following function p erforms adequately

1

2

if j j j j j j

2

f

2

otherwise

It p enalizes error and makes it least exp ensive to concentrate error on some phonemes and sememes

rather than to distribute error with partial activations

Add underparsed sememes S c to the word if there are no underparsed

i

phoneme sequences in the utterance as then the misparse would most likely b e due

to a missing word

Alter the words phoneme sequence so as to eliminate phonemes that mismatch with

the utterance and to eliminate phonemes that are overparsed P c

i

Extend the words phoneme sequence so as to incorp orate neighboring underparsed

phonemes P c up to a certain maximal length of extension

i

In all cases the original word remains in the dictionary and a new word is created that

incorp orates the change

Wholly new words are also created to account for unparsed p ortions of the utterance

A set of sequences of consecutive underparsed phonemes from the original utterance is

created These sequences represent the phonemic comp onents of p otential new words

Similarly a set of the underparsed sememes is created If there are two or fewer under

parsed phoneme sequences and each is b elow a maximum newwordlength then each one

is turned into a new word using the set of underparsed sememes as the hypothesized

semantic representation

New words start out with a high temp erature near

Co oling Words

As can b e seen from the pseudo co de in gure new words are used in a reparse of

the utterance If the result is a go o d parse and these words are highly activated then

condence in the words is increased by co oling the temp erature asymptotically towards

The coolwords subroutine of the algorithm co ols a word from a parse if it meets

each of several conditions

w

M It has no phonemic mismatches P

w

It has no semantic mismatches S M

Its neighboring phonemes are well parsed P c where l and r are the left and

lr

right phonemic b oundaries of the word match

w

Its activation is over a threshold c

Co oling is a function of the total parse error E A low error implies more co oling

Words are therefore co oled when they are condently used in a successful parse A nearly

frozen word has successfully taken part in a number of go o d parses A warm word has

not reliably demonstrated its necessity

New words are not added to the dictionary unless they are co oled after the reparse

of the utterance that caused their creation This minimizes the number of p otentially

disruptive changes to the dictionary

Removing Words from the Dictionary

As utterances are parsed new words are created to explain and correct errors and are

added to the dictionary Many of these new words are unsuccessful and do not participate

in many parses they represent failed branches of the search If a word remains unco oled

for some time p erio d it is a go o d indication that adding that word to the dictionary

was a mistake After a certain xedlength trial number of utterances a word b ecomes

op en for deletion from the dictionary Periodically words are garbagecollected from the

dictionary with the probability of deletion roughly prop ortional to the temp erature A

fully frozen word temp erature will never b e deleted A warm word is highly likely

8

to b e deleted

As the algorithm starts to learn with an empty dictionary the rst words it creates

tend to b e utteranceencompassing such as atsrait f THAT BE RIGHT g Later the

algorithm learns the comp onents of such words ie at s and rait Periodically

the dictionary attempts to reduce its size by parsing each of its words If it can success

fully parse a word without recourse to the word itself that word is eliminated from the

dictionary

The pro cess of removing words from the dictionary is a means of implementing a prior

preference for a small dictionary one with no unnecessary words The gradual co oling

of words used in parses ensures that words remain in the dictionary only if the data

necessitate their presence

A Short Example

Here we present a short description of the algorithms p erformance on a single example

from the test suite yukiktsk paired with f YOU KICK OFF THE SOCK g At the

p oint that the utterance is encountered the matching pro cess nds acceptably close

matches in the dictionary yu f YOU g oset f THE g oset and rsk f

SOCK g oset Notice that you and the have no mismatches but sock has an extra r

You and the are well co oled temp erature near at this p oint but not surprisingly sock

is still quite warm temp erature

Parsing with these three words leaves all with activation near Two new words

are then created The rst is a x of the phonemic mismatch in rsk It is sk f

SOCK g the old word with the one mismatched phoneme removed The second word is

completely new created to account for the unparsed parts of the utterance kikt

f KICK OFF g The sentence is then rematched and parsed In the new parse rsk

is given low activation b ecause the sentence can b e parsed with less error using sk

instead and kikt is given activation near The total error is quite low it would

b e zero if the gradient descent search had pro duced correct activations of exactly or

8

This heuristic prevents the algorithm from learning words that only o ccur once or twice a problem

given Careys evidence that children can and need to acquire some words from a very small number

of exp osures One solution would b e to sp eed the co oling pro cess as the ma jority of the dictionary

b ecomes stable But the problem has deep ro ots and needs greater investigation any online algorithm

that maintains no explicit memory of previous data p oints will have a dicult time recovering from some

of its mistakes The usual solution of weightdecay towards a prior to improve generalization and allow

an algorithm to recover from noisy or misinterpreted data do es not work well if the algorithm must

maintain p erfect memory

and the activated words are co oled Thus sk and kikt are co oled but rsk is

not Garbage collection will eventually remove rsk from the dictionary which is not

likely to b e co oled again given the new comp etition from sk

Tests and Results

To test the algorithm we are using utterances from the Childes database of mothers

sp eech to children These text utterances were run through a publicly available

texttophoneme engine and also used to create a semantic dictionary in which each ro ot

word from the utterances was mapp ed to a corresp onding sememe Various forms of a ro ot

see saw seeing all map to the same sememe eg SEE Semantic representations

for a given utterance are merely unordered bags of sememes generated by taking the union

of the sememe for each word in the utterance Figure contains the rst utterances

from the database

Sentence Phoneme Sequence Sememe Set

this is a b o ok szebuk f THIS BE A BOOK g

what do you see in the b o ok wtduyusinbuk f WHAT DO YOU SEE IN THE BOOK g

how many rabbits haVmnirabbts f HOW MANY RABBIT g

how many haVmni f HOW MANY g

one rabbit wnrabbt f ONE RABBIT g

what is the rabbit doing wtzrabbtdu f WHAT BE THE RABBIT DO g

Figure The rst utterances from the Childes database used to test the algorithm

We describ e the results of a single run of the algorithm trained on one exp osure to each

of the utterances Successive runs tend to result in nearly identical

The nal dictionary contains words some entries are dierent forms of a common

stem Over the corpus the algorithm has b een exp osed to dierent stems of

the words in the dictionary have never b een used in a lowerror parse We eliminate

these words most of which are high temp erature leaving Figure presents some

entries in the nal dictionary and gure presents all of the entries that could

b e considered signicant mistakes So out of the entries are correct

The most obvious error visible in gure is the sux ing i which should b e

semanticless have an empty sememe set Indeed a semanticless word is prop erly hy

p othesized but a sp ecial mechanism prevents semanticless words from b eing added to

the dictionary This mechanism is necessary b ecause the error function overpromotes

semanticless words and results in p o or learning of phonological words that happ en to

contain them as substrings Without it the system would chance up on a new word like

ring ri use the semanticless i to account for most of the sound and build a new

word r f RING g to cover the rest witness something in gure One solution to

such a problem is to incorp orate additional linguistic knowledge ab out word structure

9

and ab out sound changes that o ccur at word b oundaries a solution discussed to some

9

For instance in English no stem may b e less and word b oundaries can sometimes b e dis

Phoneme Sequence Sememe Set Phoneme Sequence Sememe Set

yu f YOU g bik f BEAK g

f THE g we f WAY g

wt f WHAT g bukkes f BOOKCASE g

tu f TO g brik f BREAK g

du f DO g fg f FINGER g

e f A g santkls f SANTA CLAUS g

t f IT g tp f TOP g

a f I g kld f CALL g

n f IN g gz f EGG g

wi f WE g S f THING g

s f BE g ks f KISS g

n f ON g hi f HEY g

Figure Dictionary entries The left are the words used most frequently in go o d

parses The right were selected randomly from the entries

extent by Cartwright and Brent and Church Alternatively as a more immediate

workaround we could provide the test corpus with an explicit semantic clue such as the

following asso ciation i f PROGRESSIVE g

Most other semanticless axes plural s for instance are also prop erly hypothesized

and disallowed but the dictionary learns multiple entries to account for them g egg

and gz eggs The system learns is was am and homonyms

read red know no without diculty

Shortcomings and Future Directions

The most obvious immediate shortcoming of the algorithm is the previously discussed

diculty with semanticless words and axes As mentioned in the introduction a more

imp ortant issue is the simplicity of the environment the current algorithm assumes We

are building a new and considerably more complex learning architecture to rectify some

of the deciencies In particular

instead of discrete phonemes the new system accepts a time series of p otentially

noisy estimated vocal articulator p ositions It attempts to nd phoneme sequences

from its dictionary that provide the most complete and consistent account for the

p erceived p ositions

the system incorp orates some knowledge of and selected to

enable it to infer some contextual soundchange rules and consequently use sound

changes as evidence of word b oundaries This knowledge should help the system

learn semanticless words and axes

tinguished with the knowledge that wordinitial obstruents are aspirated t is pronounced with an

exhalation of air in top but not stop

Phoneme Sequence Sememe Set Phoneme Sequence Sememe Set

f BE g nupis f SNOOPY g

f YOU g wo f WILL g

f DO g zu f AT ZOO g

Miz f SHE BE g don f DO g

wthappind f WHAT HAPPENg r f BE g

dont f DO NOT g smd f MUD g

smS f SOMETHING g nidlz f NEEDLE BE g

wtriz f WHAT BE THESE g drnwiz f DROWN OTHERWISE g

shappin f HAPPEN g slf f YOU g

t f NOT g f BE g

sktt f BOB SCOTT g

Figure All of the signicant dictionary errors Some of them like Miz are conglom

erations that should have b een divided Others like t wo and don demonstrate

how the system comp ensates for the morphological irregularity of English contractions

The i problem is discussed in the text misanalysis of the role of i also manifests

itself on something

the system accepts a complex conditional probability distribution over semantic

symbols as the childs interpretation of the environment and uses a Bayesian mo del

to determine which sememes are actually represented by the utterance

If successful the improved system will considerably expand the scop e and p erformance

of the preliminary work presented here

Acknowledgements

This research is supp orted by NSF grant ASC and ARPA under the HPCC

program David Baggett Rob ert Berwick Jerey Siskind Oded Maron Greg Galp erin

and Marina Meila have made valuable suggestions and comments related to this work

A Common version of the algorithm and the test corpus can b e found in ftp

ftpaimitedupubuserscgdemarcpapersicgi

References

J Bachenko and E Fitzpatrick A Computational Grammar of Discourse

Neutral Proso dic Phrasing in English Computational linguistics

P Brown J Co cke S Della Pietra Della Pietra F Jelinek J Laerty R Mercer

and P Ro osin A Statistical Approach to Machine Translation Computa

tional Linguistics

S Carey The Child as a Word Learner In M Halle J Bresnan and G Miller

eds Linguistic Theory and Psychological Reality MIT Press Cambridge Mas

sachussets

T Cartwright and M Brent Segmenting Sp eech Without a Lexicon Evi

dence for a Bo otstrapping Mo del of Lexical Acquisition in Pro ceedings of the th

Conference of the Cognitive Science So ciety

K Church Phonological parsing and lexical retrieval Cognition

A P Dempster and N M Laird and D B Rubin Maximum Likelihoo d from

Incomplete Data via the EM Algorithm Journal of the Royal Statistical So ciety B

R Jackendo Semantic Structures MIT Press Cambridge Massachussets

B MacWhinney and C Snow The Child Language Data Exchange System

Journal of Child Language

D Olivier Stochastic Grammars and Mechanisms PhD

thesis Harvard University Cambridge Massachusetts

J Rissanen Mo deling by shortest data description Automatica

J Siskind Naive Physics Event Perception and Lan

guage Acquisition PhD thesis Massachusetts Institute of Technology Cambridge

Massachusetts

J Siskind Lexical Acquisition as Constraint Satisfaction Technical Rep ort

IRCS University of Pennsylvania Institute for Research in Cognitive Science

J Siskind Lexical Acquisition in the Presence of Noise and Homonymy In

Pro ceedings of the th National Conferance on Articial Intelligence AAAI

P Supp es The semantics of childrens language American Psychologist