<<

cmp-lg/9503019 20 Mar 1995 SA TZ An Adaptiv David System e D Berk Univ Computer Decem Rep ort Sen Palmer eley ersit tence b er No y California Science of California UCBCSD Segmen Division EECS tation

SATZ An Adaptive Sentence Segmentation System

David D Palmer

Abstract

The segmentation of a text into sentences is a necessary prerequisite for

many natural language pro cessing tasks including partofsp eech tagging

and sentence alignment This is a nontrivial task however since endof

sentence punctuation marks are ambiguous A p erio d for example can

denote a decimal p oint an abbreviation the end of a sentence or even an

abbreviation at the end of a sentence To disambiguate punctuation marks

most systems use brittle sp ecialpurp ose grammars and

exception rules Such approaches are usually limited to the text genre for

which they were develop ed and cannot b e easily adapted to new text typ es

They can also not b e easily adapted to other natural languages

As an alternative I present an ecient trainable algorithm that can

b e easily adapted to new text genres and some range of natural languages

The algorithm uses a lexicon with partofsp eech probabilities and a feed

forward neural network for rapid training The metho d describ ed requires

minimal storage overhead and a very small amount of training data The

algorithm overcomes the limitations of existing metho ds and pro duces a very

high accuracy

The results presented demonstrate the successful implementation of the

algorithm on a sentence English corpus Training time was less than

one minute on a workstation and the metho d correctly lab eled over of

the sentence b oundaries The metho d was also successful in lab eling texts

containing no capital letters The system has b een successfully adapted to

German and French The training times were similarly low and the resulting accuracy exceeded

Contents

Intro duction

The Problem

Baseline

Previous Approaches

Regular Expressions and Heuristic Rules

Regression Trees

endings and word lists

Feedforward Neural Network

System Desiderata

The SATZ System

Tokenizer

Partofsp eech Lo okup

Representing Context

The Lexicon

Heuristics for Unknown

Descriptor Array Construction

Classication by Neural Network

Network architecture

Training

Implementation

Exp eriments and Results English

Context Size

Hidden Units

Sources of Errors

Thresholds

Singlecase texts

Lexicon size

Adaptation to Some Other Languages

German

German News Corpus

Suddeutsc he Zeitung Corpus i

French

Conclusions and Future Directions ii

Intro duction

The Problem

The sentence is an imp ortant unit in many natural language pro cessing

1

tasks For example the alignment of sentences in parallel multilingual cor

p ora requires rst that the individual sentence b oundaries b e clearly lab eled

Gale and Church Kay and Roscheinsen Most partofsp eech

2

taggers also require the disambiguation of sentence b oundaries in the in

put text Church Cutting et al This is usually accomplished

by inserting a unique sequence at the end of each sentence such

that the NLP to ols analyzing the text can easily recognize the individual

sentences

Segmenting a text into sentences is a nontrivial task however since all

3

endofsentence punctuation marks are ambiguous A p erio d for example

can denote a decimal p oint an abbreviation the end of a sentence or even an

abbreviation at the end of a sentence An exclamation p oint and a question

mark can o ccur within quotation marks or parentheses as well as at the end

of a sentence The ambiguity of these punctuation marks is illustrated in the

following dicult cases

The group included Dr JM Freeman and T Boone Pickens Jr

This issue crosses party lines and crosses philosophical lines said

Rep John Row land R Conn

The existence of punctuation in grammatical subsentences suggests the

p ossibility of a further decomp osition of the sentence b oundary problem into

typ es of sentence b oundaries one of which would b e emb edded sentence

1

Much of the work contained in this rep ort has b een rep orted in a similar form in

Palmer and Hearst and p ortions of this work were done in collab oration with

Marti Hearst of Xerox PARC

2

The terms sentence segmentation sentence boundary disambiguation and sentence

boundary labeling are interchangeable

3

In this rep ort I will consider only the p erio d the exclamation p oint and the question

mark to b e p ossible endofsentence punctuation marks and all references to punctua

tion marks will refer to these three Although the colon the semicolon and conceivably

the comma can also delimit grammatical sentences their usage is b eyond the scop e of this

work

b oundary Such a distinction might b e useful for certain applications which

analyze the grammatical structure of the sentence However in this work

I will address the lesssp eci problem of determining sentence b oundaries

b etween sentences

In examples and the word immediately preceding and the word

immediately following a punctuation mark provide imp ortant information

ab out its role in the sentence However more context may b e necessary

such as when punctuation o ccurs in a subsentence within quotation marks

or parentheses as seen in example or when an abbreviation app ears at

the end of a sentence as seen in ab

a It was due Friday by pm Saturday would be too late

b She has an appointment at pm Saturday to get her car xed

Section contains a discussion of metho ds of representing context

Baseline

When evaluating a sentence segmentation algorithm comparison with the

baseline algorithm is an imp ortant measure of the success of the algorithm

A baseline algorithm in this case is simply a very naive algorithm which

would lab el each punctuation mark as a sentence b oundary Such a baseline

algorithm would have an accuracy equal to the lower bound of the text the

p ercentage of p ossible sentenceending punctuation marks in the text which

indeed denote sentence b oundaries A go o d sentence segmentation algorithm

will thus have an accuracy much greater than the lower b ound

Since the use of abbreviations in a text dep ends on the particular text and

text genre the numb er of ambiguous punctuation marks and therefore the

p erformance of the baseline algorithm will vary dramatically dep ending on

text genre and even within a single text genre For example Lib erman and

Church rep ort that the Wall Street Journal corpus contains

p erio ds p er million tokens whereas in the Tagged Brown corpus Francis

and Kucera the gure is only p erio ds p er million tokens They

also rep ort that of the p erio ds in the WSJ corpus denote abbreviations

lower b ound compared to only in the Brown corpus lower b ound

Riley In contrast Muller rep orts lower b ound statistics

ranging from to within the same corpus of scientic abstracts

Such a range of lower b ound gures might suggest the need for a robust

approach that can adapt rapidly to dierent text requirements

Previous Approaches

Although sentence b oundary disambiguation is an essential prepro cessing

step of many natural language pro cessing systems it is a topic rarely ad

dressed in the literature Consequently there are few published references

There are also few public domain systems for p erforming the segmentation

task and most current systems are sp ecically tailored to the particular cor

pus analyzed and are not designed for general use

Regular Expressions and Heuristic Rules

The metho d currently widely used for determining sentence b oundaries is

a regular grammar usually with limited lo okahead In the simplest imple

mentation of this metho d the grammar rules attempt to nd patterns of

characters such as p erio dspacecapital letter which usually o ccur at the

end of a sentence More robust implementations consider the entire word

preceding and following the punctuation mark and include extensive word

lists and exception lists to attempt to recognize abbreviations and prop er

nouns There are several examples of rulebased and heuristic systems for

which p erformance numb ers are available

Christiane Homann used a regular expression approach to clas

sify punctuation marks in a corpus of the German newspap er die tageszeitung

with a lower b ound of She used the UNIX to ol Lesk and Schmidt

and a large abbreviation list to classify o ccurrences of p erio ds accord

ing to their likely function in the text Tested on p erio ds from the

corpus her metho d correctly classied over of the sentence b oundaries

The metho d was develop ed sp ecically for the tageszeitung corpus and Ho

mann rep orts that success in applying her metho d to other corp ora would

b e dep endent on the quality of the available abbreviation lists

4

Gabriele Schicht over the course of four months develop ed a metho d for

segmenting sentences in a corpus of the German newspap er die Suddeutsche

4

At the University of Munich Germany

Zeitung The metho d uses a program written in the text manipulation lan

guage p erl Wall and Schwartz to analyze a context consisting of the

word immediately preceding and the word immediately following each punc

tuation mark In the case of a p erio d following a numb er the metho d consid

ers more context one word b efore the numb er and one word after the p erio d

More context is also considered when attempting to recognize abbreviations

containing several blank spaces such as v i S d P verantwortlich

im Sinne des Presserechts Using a Next workstation the metho d requires

minutes to classify cases as each word in the limited con

text must b e lo oked up in a word lexicon Schicht rep orts over

5

accuracy using the metho d

6

Mark Wasson and colleagues invested sta months developing a sys

tem that recognizes sp ecial tokens eg nondictionary terms such as prop er

names legal statute citations etc as well as sentence b oundaries From

this Wasson built a standalone b oundary recognizer in the form of a gram

mar converted into nite automata with states and transitions

excluding the lexicon The resulting system when tested on megabytes

of news and case law text achieved an accuracy of at sp eeds of

characters p er CPU second on a mainframe computer When tested against

upp ercase legal text the algorithm still p erformed very well achieving ac

curacies of and on test data of and p erio ds re

sp ectively It is not likely however that the results would b e this strong on

7

lowercase data

Although the regular grammar approach can b e successful it requires

a large manual eort to compile the individual rules used to recognize the

sentence b oundaries Such eorts are usually develop ed sp ecically for a text

corpus Lib erman and Church Homann and would probably

not b e p ortable to other text genres Because of their reliance on sp ecial

languagesp ecic word lists they are not p ortable to other natural languages

without rep eating the eort of compiling extensive lists and rewriting rules

In addition heuristic approaches dep end on having a wellb ehaved corpus

with regular punctuation and few extraneous characters and they would

5

All information ab out this system is courtesy of a p ersonal communication with

Gabriele Schicht

6

At Mead Data Central

7

All information ab out Meads system is courtesy of a p ersonal communication b etween

Marti Hearst and Mark Wasson

probably not b e very successful with texts obtained via optical character

recognition OCR

Regression Trees

Riley describ es an approach that uses regression trees Breiman et al

to classify sentence b oundaries according to the following features

Probabilityword preceding o ccurs at end of sentence

Probabilityword following o ccurs at b eginning of sentence

Length of word preceding

Length of word after

Case of word preceding Upp er Lower Cap Numb ers

Case of word following Upp er Lower Cap Numb ers

Punctuation after if any

Abbreviation class of words with

The metho d uses information ab out one word of context on either side of

the punctuation mark and thus must record for every word in the lexicon

the probability that it o ccurs next to a sentence b oundary Probabilities were

compiled from million words of prelab eled training data from a corpus

of AP newswire The results were tested on the Brown corpus achieving an

8

accuracy of

Word endings and word lists

Muller provides an exhaustive analysis of sentence b oundary disam

biguation as it relates to lexical endings and the identication of abbrevia

tions and words surrounding a punctuation mark fo cusing on text written

in English This approach makes multiple passes through the data to nd

8

Time for training was not rep orted nor was the amount of the Brown corpus against

which testing was p erformed it is assumed the entire Brown corpus was used

recognizable suxes and thereby lters out words which arent likely to b e

abbreviations The morphological analysis makes it p ossible to identify words

which are not otherwise present in the extensive word lists used to identify

abbreviations Accuracy rates of are rep orted for this metho d tested

on over scientic abstracts with a lower b ound ranging from

to

Feedforward Neural Network

Humphrey and Zhou rep ort using a feedforward neural network to

disambiguate p erio ds and achieve an accuracy averaging They use a

regular grammar to tokenize the text b efore training the neural nets but no

9

further details of their approach are available

System Desiderata

Each of the approaches describ ed ab ove has disadvantages to overcome A

successful sentenceb oundary disambiguation algorithm should have the fol

lowing characteristics

 The approach should b e robust and should not require a handbuilt

grammar or sp ecialized rules that dep end heavily on capitalization

multiple spaces b etween sentences etc Thus the approach should

adapt easily to new text genres and some new languages

 The approach should train quickly on a small training set and should

not require excessive storage overhead

 The approachs results should b e very accurate and it should b e ecient

enough that it do es not noticeably slow down text prepro cessing

 The approach should b e able to sp ecify no opinion on cases that

are to o dicult to disambiguate rather than making underinformed

guesses

9

Accuracy results were obtained courtesy of a p ersonal communication b etween Marti

Hearst and Jo e Zhou

In the following sections I present an approach that meets each of these

criteria pro duces a very low error rate and b ehaves more robustly than

solutions that require manually designed rules

The SATZ System

This section describ es the structure of my adaptive sentence segmentation

10

system known as SATZ My approach in the SATZ system is to repre

sent the context surrounding a punctuation mark as a series of vectors of

probabilities The probabilities used for each word in the context are the

prior partofsp eech probabilities obtained from a lexicon containing partof

sp eech frequency data The context vectors or descriptor arrays are used

as input to a neural network trained to disambiguate sentence b oundaries

The output of the neural network is then used to determine the role of the

punctuation mark in the sentence The architecture of the system is shown

in Figure and the following sections describ e the individual stages in the

pro cess

Tokenizer

The rst stage of the pro cess is lexical analysis which breaks the input text a

stream of characters into tokens The SATZ tokenizer is implemented using

the UNIX to ol lex Lesk and Schmidt and is a slightlymo died version

of the tokenizer from the PARTS partofsp eech tagger Church The

tokens returned by the lex program can b e a sequence of alphab etic characters

11

ie words a sequence of digits or a single nonalphanumeric character

such as a p erio d or quotation mark

Partofsp eech Lo okup

Representing Context

The context surrounding a punctuation mark can b e represented in various

ways The simplest and most straightforward is to use the individual words

10

Satz is the German word for sentence

11

Numb ers containing p erio ds acting as decimal p oints are considered a single token

This eliminates one p ossible ambiguity of the p erio d at the lexical analysis stage Input Text

Tokenization

Part−of−speech Lookup

Descriptor array construction

Classification by neural network

Text with sentence boundaries disambiguated

Figure SATZ Architecture

preceding and following the punctuation mark as in this example using three

words of context on either side of the punctuation mark

at the plant He had thought

For each word in the language we would then determine how likely it is

to come at the end or b eginning of a sentence However compiling these

gures for each word in a language is very timeconsuming and requires large

amounts of storage and it is unlikely that such information will b e useful to

later stages of pro cessing

As an alternative the context could b e approximated by using a single

partofsp eech for each word The ab ove context would then b e represented

by the following partofsp eech sequence

preposition article noun

pronoun verb verb

Requiring a single partofsp eech for each word presents a pro cessing cir

cularity b ecause most partofsp eech taggers require predetermined sentence

b oundaries sentence lab eling must b e done b efore tagging But if sentence

lab eling is done b efore tagging no partofsp eech assignments are available

for the b oundarydetermination algorithm

To avoid this pro cessing circularity and avoid the need for a single partof

sp eech for each word the context can b e further approximated by the prior

probabilities of all partsofsp eech for that word Each word in the context

would thus b e represented by a series of p ossible partsofsp eech as well as

the probability that the word o ccurs as each partofsp eech Continuing the

example the context b ecomes

preposition article nounverb

pronoun verb nounverb

This denotes that at and the have a probability of o ccurring as a

prep osition and article resp ectively plant has a probability of o ccurring

as a noun and a probability of o ccurring as a verb and so on These

probabilities are based on o ccurrences of the words in a pretagged corpus

and are therefore corpus dep endent Such partofsp eech information is often

used by a partofsp eech tagger and would thus b e readily available and

would not require excessive storage overhead For these reasons I chose to

approximate the context in my system by using the prior partofsp eech

probabilities

The Lexicon

An imp ortant comp onent of the SATZ system is the lexicon containing part

ofsp eech frequency data from which the probabilities are calculated Words

in the lexicon are followed by a series of partofsp eech tags and asso ciated

frequencies representing the p ossible partsofsp eech for that word and the

frequency with which the word o ccurs as each partofsp eech The lexical

lo okup stage of the SATZ system nds a word in the lexicon if it is present

and returns the p ossible partsofsp eech For the English word well for

example the lo okup mo dule might return the tags

JJ NN QL RB UH VB

12

indicating that in the corpus on which the lexicon is based the word well

o ccurred times as an adjective as a singular noun as a qualier

as an adverb as an interjection and as a singular verb

Heuristics for Unknown Words

If a word is not present in the lexicon the system contains a set of heuristics

which attempt to assign the most reasonable partsofsp eech to the word A

summary of these heuristics follows

 Unknown tokens containing a digit are assumed to b e numb ers

 Any token b eginning with a p erio d exclamation p oint or question

mark is assigned a p ossible endofsentence punctuation tag This

catches common sequences like

 Common morphological endings are recognized and the appropriate

partofsp eech is assigned to the entire word

 Words containing a hyphen are assigned a series of tags and frequencies

denoting unknown hyphenated word

 Words containing an internal p erio d are assumed to b e abbreviations

 Capitalized words are not always prop er nouns even when it app ears

somewhere other than in a sentences initial p osition eg the word

American is often used as an adjective Those words not present in

the lexicon are assigned a certain probability for English of b eing

a prop er noun

12

In this example the frequencies are derived from the Brown corpus Francis and

Kucera

 Capitalized words app earing in the lexicon but not registered as prop er

nouns can nevertheless still b e prop er nouns In addition to the part

ofsp eech frequencies present in the lexicon these words are assigned a

certain probability of b eing a prop er noun for English

 As a last resort the word is assigned a series of p ossible tags with a

uniform frequency distribution

These heuristics can b e easily mo died and adapted to the sp ecic needs

of a new language For example the probability of a capitalized word b eing

a prop er noun is higher in English than in German where all nouns are also

capitalized

Descriptor Array Construction

For each token in the input text we need to construct a vector of probabilities

to numerically describ e the token This vector is known as a descriptor array

The lexicon may contain as many as or very sp ecic partsofsp eech

which we rst need to map into more general categories For example the

tags for present tense verb past participle and modal verb all map into the

more general verb category The partsofsp eech returned by the lo okup

mo dule are thus mapp ed into the general categories given in Figure and

the frequencies for each category are summed The category frequencies

for the word are then converted to probabilities by dividing the frequencies

for each In addition to these probabilities the descriptor array also

contains two additional ags that indicate if the word b egins with a capital

letter and if it follows a punctuation mark for a total of items in each

descriptor array

Classication by Neural Network

The descriptor arrays representing the tokens in the context are used as the

input to a fullyconnected feedforward neural network shown in Figure

Network architecture

The network accepts as input k  input units where k is the numb er of

words of context surrounding an instance of an endofsentence punctuation

noun verb

article mo dier

conjunction pronoun

prep osition prop er noun

numb er comma or semicolon

left parentheses right parentheses

nonpunctuation character p ossessive

colon or dash abbreviation

sentenceending punctuation others

Figure Elements of the descriptor array assigned to each incoming token

mark referred to in this rep ort as kcontext and is the numb er of

elements in the descriptor array describ ed in Section The input layer is

fully connected to a hidden layer consisting of j hidden units with a sigmoidal

squashing activation function The hidden units in turn feed into one output

13

unit which indicates the results of the function

The output of the network a single value b etween and represents the

strength of the evidence that a punctuation mark o ccurring in its context is

indeed the end of the sentence I dene two adjustable sensitivity thresholds

t and t which are used to classify the results of the disambiguation If the

0 1

output is less than t the punctuation mark is not a sentence b oundary if the

0

output is greater than or equal to t it is a sentence b oundary Outputs which

1

fall b etween the thresholds cannot b e disambiguated by the network and are

marked accordingly so they can b e treated sp ecially in later pro cessing

When t t no punctuation mark is left ambiguous Section describ es

0 1

exp eriments which vary the sensitivity thresholds

To disambiguate a punctuation mark in a kcontext a window of k

tokens and their descriptor arrays is maintained as the input text is read

The rst k and nal k tokens of this sequence represent the context in

which the middle token app ears If the middle token is a p otential endof

sentence punctuation mark the descriptor arrays for the context tokens are

13

This network can b e thought of roughly as a TimeDelay Neural Network TDNN

Hertz et al since it accepts a sequence of inputs and is sensitive to p ositional

information within the sequence However since the input information is not really shifted

with each time step but rather only presented to the neural net when a punctuation mark

is in the center of the input stream this is not technically a TDNN DA DA DA DA

2020 20 20 Input Layer

Hidden Layer

Output Layer

Output value (0 < x < 1)

Figure Neural Network Architecture DA descriptor array of items

input to the network and the output result indicates the appropriate lab el

sub ject to the thresholds t and t

0 1

Training

Training data consist of two texts in which all b oundaries are already la

b eled The rst text the training text contains b etween and test

cases where a test case is an ambiguous punctuation mark The weights of

the neural network are trained on the training text using the standard back

propagation algorithm Hertz et al The second text used in training

is the crossvalidation text Bourland and Morgan and contains b e

tween and test cases separate from the training text Training of

the weights is not p erformed on this text the crossvalidation text is instead

used to increase the generalization of the training such that when the total

training error over the crossvalidation text reaches a minimum training is

halted Testing is then p erformed on texts indep endent of the training and

crossvalidation texts All training times rep orted in this rep ort were ob

tained on a Hewlett Packard Workstation unless otherwise noted

Implementation

I implemented the SATZ system as a series of C mo dules and UNIX shell

scripts The software is available via anonymous ftp to cstrCSBerkeleyEDU

in the directory pubcstr as the compressed tar le satztarZ App endix A

contains the README le I wrote to explain the structure of the software

and to assist users in adapting it for their own purp oses

Exp eriments and Results English

I tested the SATZ system for the English language using texts from the Wall

Street Journal p ortion of the ACLDCI collection Church and Lib erman

14

I rst constructed a training text of test cases and a cross

validation text of test cases from the WSJ corpus I then constructed

a separate test text consisting of test cases with a lower b ound of

The lexicon and thus the frequency counts used to calculate the

descriptor arrays were taken from the PARTS tagger Church which

derived the counts from the Brown corpus Francis and Kucera

Context Size

In order to determine how much context is necessary to accurately segment

sentences in a text I varied the size of the context and obtained the re

sults in table The Training Error is the least mean squares Hertz et al

15

error onehalf the sum of the squares of all the errors for all

items in the training set The Cross Error is the equivalent value for the

crossvalidation set These two error gures give an indication of how well

the network learned the training data b efore stopping From these data I

concluded that a token context preceding the punctuation mark and

following pro duces the b est results

14

Note that constructing a training crossvalidation or test text simply involves

manually inserting a unique character sequence at the end of each sentence

15

The error of a particular item is the dierence b etween the desired output and the

actual output of the neural net

Context Training Training Cross Testing Testing

Size Ep o chs Error Error Errors Error

context

context

context

Table Results of comparing context sizes

Hidden Training Training Cross Testing Testing

Units Ep o chs Error Error Errors Error

Table Results of comparing hidden layer sizes context

Hidden Units

To determine the size of the hidden layer in the neural network which pro

duced the highest output accuracy I exp erimented with various hidden layer

sizes and obtained the results in table From these data I concluded that

the b est accuracy in this case is p ossible using a neural network with two

no des in its hidden layer

Sources of Errors

As describ ed in Sections and the b est results were obtained with a

context size of tokens and a hidden layer with units This conguration

pro duced a total of errors out of test cases for an accuracy of

These errors fall into two ma jor categories ifalse p ositive ie a

punctuation mark the metho d erroneously lab eled as a sentence b oundary

and ii false negative ie an actual sentence b oundary which the metho d

did not lab el as such Table contains a summary of these errors

These errors can b e decomp osed into the following groups

false p ositive at an abbreviation within a title or name usually

b ecause the word following the p erio d exists in the lexicon with

false p ositives

false negatives

total errors out of items

Table Results of testing on mixedcase items t t

0 1

context hidden units

other partsofsp eech Mr Gray Col North Mr Ma jor Dr

Carp enter Mr Sharp

false negative due to an abbreviation at the end of a sentence

most frequently Inc Co Corp or US which all o ccur within

sentences as well

false p ositive or negative due to a sequence of characters including

a p erio d and quotation marks as this sequence can o ccur b oth

within and at the end of sentences

false negative resulting from an abbreviation followed by quotation

marks related to the previous two typ es

false p ositive or false negative resulting from presence of ellipsis

which can o ccur at the end of or within a sentence

miscellaneous errors including extraneous characters dashes as

terisks etc ungrammatical sentences missp ellings and paren

thetical sentences

The rst two items indicate that the system is having diculty recogniz

ing the function of abbreviations I attempted to counter this by dividing

the abbreviations in the lexicon into two distinct categories title abbre

viations such as Mr and Dr which almost never o ccur at the end of a

sentence and all other abbreviations This new classication however sig

nicantly increased the training time and eliminated only of the errors

The third and fourth items demonstrate the diculty of distinguishing

subsentences within a sentence This problem may b e addressed by creating

a new classication for punctuation marks the emb edded endofsentence

as discussed in section The fth class of error may similarly b e addressed

by creating a new classication for ellipses and then attempting to determine

the role of the ellipses indep endent of the sentence b oundaries

Lower Upp er False False Not Were Not Testing

Thresh Thresh Pos Neg Lab eled Correct Lab eled Error

Table Results of varying the sensitivity thresholds test cases

t t context hidden units

0 1

Thresholds

As describ ed in Section the output of the neural network is used to

determine the function of a punctuation mark based on its value relative

to two sensitivity thresholds with outputs that fall b etween the thresholds

denoting that the function of the punctuation mark is still ambiguous These

are shown in the Not Lab eled column of table which gives the results of

a systematic exp eriment with the sensitivity thresholds As the thresholds

were moved from the initial values of and certain items which had

b een classied as False Pos or False Neg fell b etween the thresholds and

b ecame Not Lab eled At the same time however items which had b een

correctly lab eled also fell b etween the thresholds and these are shown in

16

the Were Correct column There is thus a tradeo decreasing the error

p ercentage by adjusting the thresholds also decreases the p ercentage of cases

correctly lab eled and increases the p ercentage of items left ambiguous

Singlecase texts

A ma jor advantage of the SATZ approach to sentence segmentation is its

robustness In contrast to many existing systems which dep end on brittle

parameters such as capitalization or spacing SATZ is able to adapt to texts

which are not wellformed such as singlecase texts The two descriptor array

ags for capitalization discussed in section allow the system to include

capitalization information when it is available When this information is

16

Note that the numb er of items in the Were Correct column is a subset of those in

the Not Lab eled column

Text Training Training Training Cross Testing

Typ e Time sec Ep o chs Error Error Error

Lowercase

Upp ercase

Table Results on singlecase texts test cases t t

0 1

context hidden units

not available the system is nevertheless able to adapt and pro duce a high

accuracy To demonstrate this robustness I converted the training cross

validation and test texts used in previous testing to a lowercaseonly format

with no capital letters After retraining the neural network with the lower

caseonly texts the SATZ system was able to correctly disambiguate

of the sentence b oundaries After converting the texts to an upp ercaseonly

format with all capital letters and retraining the network on the texts in

17

this format the system was able to correctly lab el These results

are summarized in table

Lexicon size

The lexicon with which I obtained the results of the previous sections was

the complete lexicon over words from the PARTS tagger Such a

large lexicon with partofsp eech frequency data is not always available so it

is imp ortant to understand the impact a more limited lexicon would have on

the accuracy of SATZ I altered the size of the English lexicon used in training

18

and testing and obtained the results in table These data demonstrate

that a larger lexicon provides faster training and a higher accuracy although

the p erformance with the smaller lexica was still almost as accurate as b efore

17

The dierence in results with upp ercase and lowercase formats can probably b e

attributed to the capitalization ags in the descriptor arrays

18

The abbreviations in the lexicon remained unchanged Altering the list of abbrevia

tions might b e an interesting future exp eriment

Words in Training Training Cross Testing Testing

Lexicon Ep o chs Error Error Errors Error

Table Results of comparing lexicon size test cases t t

0 1

context hidden units

Adaptation to Some Other Languages

Since the disambiguation comp onent of the sentence segmentation algorithm

the neural network is language indep endent the SATZ system can b e easily

adapted to some natural languages other than English Adaptation to other

languages involves setting a few languagesp ecic parameters and obtaining

or building a small lexicon containing the necessary partofsp eech data I

successfully adapted the SATZ system to German and French and the results

are describ ed b elow

German

The German lexicon was built from a series of publicdomain word lists ob

tained from the Consortium for Lexical Research The lists of German ad

jectives verbs prep ositions articles and abbreviations were converted to

the appropriate format describ ed in Section In the resulting lexicon

of words each word was assigned only the partsofsp eech for the

lists from which it came with a frequency of for each partofsp eech The

lexicon contained German abbreviations The partofsp eech tags used

were identical to those from the English lexicon and the descriptor array

mapping remained also unchanged This lexicon was used in testing with

two separate corp ora The total time required to adapt SATZ to German

including building the lexicon and constructing training texts was less than

one day

German News Corpus

The German News Corpus was constructed from a series of publicdomain

German articles distributed internationally by the University of Ulm It

contained over test cases from the months JulyOctob er with

a lower b ound of I constructed a training text of cases from

the corpus as well as a crossvalidation text of cases The training was

completed in seconds and resulted in a rate of correctly lab eled

sentence b oundaries in the corpus of cases Rep eating the training

and testing with a lowercaseonly format gave an accuracy rate of

This higher accuracy for the lowercase text might b e a result of the German

capitalization rules in which all nouns not just prop er nouns are capitalized

I p erformed the training b oth mixedcase and lowercase without altering

the heuristics describ ed in section Finetuning the probabilities for

unknown capitalized words in German may increase the mixedcase accuracy

Suddeutsc he Zeitung Corpus

The Suddeutsc he Zeitung Corpus compiled at the University of Munich con

19

sists of several megabytes of online texts from the German newspap er I

constructed a training text of over items from the Suddeutsc he Zeitung

Corpus and a crossvalidation text of over items Training was p er

20

formed in less than minutes on a Next workstation When tested on the

Septemb er p ortion of the SZ corpus containing approximately

21

items the SATZ system pro duced an accuracy comparable to those ob

tained with Schichts metho d describ ed in Section

French

The French lexicon was compiled from the partofsp eech data obtained by

running the PARC partofsp eech tagger Cutting et al on a p ortion

19

All my work with the Suddeutsc he Zeitung Corpus was p erformed in collab oration

with Prof Franz Guenthner and Gabriele Schicht of the Centrum fur Informations und

Sprachverarb eitung at the University of Munich

20

The Next workstation is signicantly slower than the Hewlett Packard workstation

used in other tests which accounts for the slower training time

21

Due to the large size of the corpus it was imp ossible to obtain an exact accuracy

p ercentage as this would involve manually checking the results

22

of the Canadian Hansards corpus The lexicon consisted of less than

words assigned partsofsp eech by the tagger including French abbrevia

tions app ended to the English abbreviations available from the English

lexicon The partofsp eech tags in the lexicon were dierent from those used

in the English implementation so the descriptor array mapping had to b e

adjusted accordingly Adapting SATZ to French was accomplished in days

A training text of test cases was constructed from the Hansards cor

pus and a crossvalidation text of cases The training was completed in

seconds and the trained network was used to lab el the sentence b ound

aries in a separate p ortion of the Hansards corpus containing punctu

ation marks with a lower b ound of The SATZ system pro duced an

accuracy of on this text Rep eating the training and testing with a

lowercaseonly format also gave an accuracy rate of

Conclusions and Future Directions

The SATZ system oers a robust rapidly trainable alternative to existing

systems which usually require extensive manual eort to develop and are

sp ecically tailored to a text genre or natural language By using prior prob

abilities of a words partofsp eech to represent the context in which the

word app ears the system oers signicant savings in parameter estimation

and training time Although the systems of Wasson and Riley rep ort

slightly b etter error rates the SATZ approach has the advantage of exibil

ity for application to new text genres small training sets and thereby fast

training times relatively small storage requirements and little manual

eort

The b oundary lab eler was designed to b e easily p ortable to new natural

languages assuming the accessibility of lexical partofsp eech frequency data

which can b e obtained by running a partofsp eech tagger over a corpus of

text if it is not already available in the tagger itself The success of applying

SATZ to German and French with limited lexica and the exp eriments in En

glish lexicon size describ ed in Section indicate that the lexicon itself need

not b e exhaustive The heuristics used within the system to classify unknown

words can comp ensate for inadequacies in the lexicon and these heuristics

22

The lexicon and all French texts training crossvalidation and test were constructed

by Marti Hearst at Xerox PARC

can b e easily adjusted to improve p erformance with a new language I am

currently working to adapt the SATZ system to additional languages includ

ing Dutch Italian and Spanish Since these languages are very similar to

the three in which SATZ has already b een implemented it will b e interesting

to investigate its eectiveness with languages having dierent punctuation

systems such as Chinese or Arabic

While the results presented here indicate that the system in its current

incarnation gives go o d results many variations remain to b e tested It would

b e interesting to systematically investigate the eects of asymmetric context

sizes varied partofsp eech categorizations abbreviation classes and larger

descriptor arrays Although the neural network used in this system provides a

simple trainable to ol for disambiguation it would b e instructive to compare

its ecacy to a similar system which uses more conventional NLP to ols such

as Hidden Markov Mo dels or decision trees

In section I discussed the representation of context and explained

the pro cessing circularity which makes it imp ossible to obtain the partsof

sp eech from a tagger since the tagger requires the sentence b oundaries It

would b e p ossible to use the sentence b oundaries lab eled by SATZ to then

use a tagger to obtain a single partofsp eech for each word It would then

b e interesting to see if this more exact partofsp eech data would improve

the accuracy of SATZ on the same text

As discussed in section there are several issues in sentence b ound

ary disambiguation I have not addressed but the metho ds develop ed in the

SATZ system could b e extended to such tasks For example dierent typ es

of sentence b oundaries such as the emb edded endofsentence could b e

identied in much the same manner by including the necessary information

in the training text and by adding output no des to the neural network An

other p otential extension of the SATZ system and a further test of its ability

to adapt to new text typ es would b e to train and run it on OCRed text

p erhaps to assist in distinguishing punctuation marks or letters

Acknowledgements

This work would not have b een p ossible without the assistance of

Marti Hearst of Xerox PARC who gave me invaluable guidance

through each step of the pro cess I would also like to thank my

research advisor Prof Rob ert Wilensky and Prof Jerome Feld

man for reading drafts of this rep ort and providing helpful sug

gestions for its improvement Thanks also to the other memb ers

of the Berkeley Articial Intelligence Research group for advice

and technical supp ort In the course of this work I was sup

p orted by a GAANN Fellowship and by the Advanced Research

Pro jects Agency under Grant No MDAJ with the

Corp oration for National Research Initiatives CNRI

References

Herve Bourland and Nelson Morgan Connectionist Speech Recogni

tion A Hybrid Approach Kluwer Academic Publishers Norwell Mass

Leo Breiman Jerome H Friedman Richard Olshen and Charles J Stone

Classication and regression trees Wadsworth International

Group Belmont CA

Kenneth W Church and Mark Y Lib erman A status rep ort on the

ACLDCI In The Proceedings of the th Annual Conference of the UW

Centre for the New OED and Text Research Using Corpora pages

Oxford

Kenneth W Church A sto chastic parts program and noun phrase

parser for unrestricted text In Second Conference on Applied Natural

Language Processing pages Austin TX

Doug Cutting Julian Kupiec Jan Pedersen and Penelop e Sibun A

practical partofsp eech tagger In The rd Conference on Applied Natural

Language Processing Trento Italy

W Francis and H Kucera Frequency Analysis of English Usage

Houghton Miin Co New York

William A Gale and Kenneth W Church A program for aligning

sentences in bilingual corp ora Computational Linguistics

John Hertz Anders Krogh and Richard G Palmer Introduction to the

theory of neural computation Santa Fe Institute studies in the sciences

of complexity AddisonWesley Pub Co Redwo o d City CA

Christiane Homann Automatische Disambiguierung von Satzgrenzen

in einem maschinenlesbaren deutschen Korpus Unpublished work with

Prof R Kohler at University of Trier Germany

TL Humphrey and Fq Zhou Perio d disambiguation using a neural

network In IJCNN International Joint Conference on Neural Networks

page Washington DC

Martin Kay and Martin Roscheinsen Texttranslation alignment

Computational Linguistics

M E Lesk and E Schmidt Lex a lexical analyzer generator Com

puting Science Technical Rep ort ATT Bell Lab oratories Murray

Hill NJ

Mark Y Lib erman and Kenneth W Church Text analysis and word

pronunciation in texttosp eech synthesis In Sadaoki Furui and Man Mo

han Sondhi editors Advances in Speech Signal Processing pages

Marcel Dekker Inc

Hans Muller V Amerl and G Natalis Worterkennungsverfahren als

Grundlage einer Universalmetho de zur automatischen Segmentierung von

Texten in Satze Ein Verfahren zur maschinellen Satzgrenzenb estimmung

im Englischen Sprache und Datenverarbeitung

David D Palmer and Marti A Hearst Adaptive sentence b oundary

disambiguation In Proceedings of the Fourth ACL Conference on Applied

Natural Language Processing October Stuttgart pages

Morgan Kaufmann

Michael D Riley Some applications of treebased mo delling to sp eech

and language indexing In Proceedings of the DARPA Speech and Natural

Language Workshop pages Morgan Kaufmann

Larry Wall and Randal L Schwartz Programming OReilly and

Asso ciates Inc Sebastop ol CA

App endix A README to accompany soft

ware version of SATZ

This document gives information about the various parts of the SATZ

program Any questions should be directed at dpalmercsberkeleyedu

FILES full descriptions in files themselves

getpartc looks up the token in the lexicon

lexyyc tokenization created by lex

netinputc formats input to neural net

tagfilec labels sentence boundaries in input file

trainc trains neural net

utilitiesc utilities used by above modules

commonh all the defines which can be altered for the program

transh mapping of partofspeech tags to descriptor array slots

utilitiesh function prototypes for utilitiesc

weightsnet trained weights for neural net created by trainc

tokenizel tokenization file for use by lex

UNIX scripts these scripts provide all the functionality for the

program File names within the scripts can be changed at

will as long as you know what you are doing Just make sure

all the file names exist before running the script

getfreqs tokenizes file looks up in lexicon this script is good for

seeing if the lookup is working properly but is not

really a part of SATZ itself

usage getfreqs input file

trainnet trains neural net

usage trainnet file with training text

bound labels boundaries in input file

usage bound file to label

Dictionary files all dictionaries or more accurately word lists

must contain lines in the following format in order to be readable

word tab TAGfreq tab TAGfreq tab TAGnfreqn

example fixedtJJtVBDtVBN

Note tab is t

Important do NOT leave a tab at the end of the line

abbrevdict abbreviation dictionary

charsdict list of necessary characters or char strings essential

endingsdict list of word endings used to guess plurals gerunds etc

propnoundict list of proper nouns optional

wordsdict main lexicon

So to use SATZ properly you need to do the following

Prepare a training text which should be an excerpt from the text

to be labeled of about sentences All boundaries in

the training text must be labeled with the character sequence

s or whatever sequence you use

Prepare a crossvalidation text in the same manner

It should be about sentences roughly half the size of

the training text Make sure the script trainnet has the

name of this text

Run trainnet on the training text Training time varies but

should be between seconds and minutes You should see

the progress of the training on the screen Net weights will

be stored in weightsnet for use by bound It is always

possible that the net wont behave nicely as neural nets are

sometimes prone to do with the backprop algorithm If this

happens you can try several things modify the training text

andor crossvalidation text slightly change the learning rate

ETA in commonh if the learning is oscillating

significantly change BECOMESTABLE or STAYSTABLE in commonh

which determine length of training based on behavior of the

crossvalidation text or simply yell lots of obscenities at

the author

Run bound on your files The output is sent to the file name

specified in the bound script so you can change it

each time if you want Just be careful not to overwrite

previously labeled text if you label more than one file

For each new language SATZ is used with you will need to adjust a

few languagespecific things in order to maximize performance

Make sure the tagset properly maps into the descriptor array This

is specified in the file transh Four of the slots in

the descriptor array are reserved and must stay the same

regardless of the mapping assuming elements in array

element miscellaneousother

element first character capitalized

element capital letter after possible sentence end

element possible endofsentence punctuation mark

Change several define declaration in commonh including

DESPERATIONLABEL a rough estimate of unknown word distribution

HYPHENLABEL distribution for unknown hyphenated words

PROPERNOUNFACTOR

PROPERNOUNAFTERDOT

Make sure you have a lexicon of words in that language with prior

partofspeech frequencies in the format above A list of

abbreviations and a list of proper nouns is optional I use

a very small items abbreviation list and no proper

noun list and get good results Just make sure the abbreviation

list includes the essential important ones in English MrDr

etc If you include these lists make sure no members of the lists

are duplicated in two lists with conflicting pos labels Also a

list of the most frequently encountered word endings in the

language will probably improve performance as this list is used

to guess unknown words based on their endings plurals gerunds etc

The endings file must contain at least one entry in order for the

program to function properly You can simply enter zzzzzzztZ

if you dont want this part to be used or you could modify the

code to never access the endings file

The file charsdict contains lots of standard characters encountered so

you dont need to put them in the worddictionary The token

end is returned by the tokenizer as a flag for training and

testing whenever it encounters the string s in the text

I chose this string assuming it would never occur naturally in any

text Feel free to change it but make sure the string you choose

is included in the lexicon with the label