SATZ an Adaptive Sentence Segmentation System

SATZ An Adaptive Sentence Segmentation System David D Palmer cmp-lg/9503019 20 Mar 1995 Rep ort No UCBCSD Decemb er Computer Science Division EECS University of California Berkeley California SATZ An Adaptive Sentence Segmentation System David D Palmer Abstract The segmentation of a text into sentences is a necessary prerequisite for many natural language pro cessing tasks including partofsp eech tagging and sentence alignment This is a nontrivial task however since endof sentence punctuation marks are ambiguous A p erio d for example can denote a decimal p oint an abbreviation the end of a sentence or even an abbreviation at the end of a sentence To disambiguate punctuation marks most systems use brittle sp ecialpurp ose regular expression grammars and exception rules Such approaches are usually limited to the text genre for which they were develop ed and cannot b e easily adapted to new text typ es They can also not b e easily adapted to other natural languages As an alternative I present an ecient trainable algorithm that can b e easily adapted to new text genres and some range of natural languages The algorithm uses a lexicon with partofsp eech probabilities and a feed forward neural network for rapid training The metho d describ ed requires minimal storage overhead and a very small amount of training data The algorithm overcomes the limitations of existing metho ds and pro duces a very high accuracy The results presented demonstrate the successful implementation of the algorithm on a sentence English corpus Training time was less than one minute on a workstation and the metho d correctly lab eled over of the sentence b oundaries The metho d was also successful in lab eling texts containing no capital letters The system has b een successfully adapted to German and French The training times were similarly low and the resulting accuracy exceeded Contents Intro duction The Problem Baseline Previous Approaches Regular Expressions and Heuristic Rules Regression Trees Word endings and word lists Feedforward Neural Network System Desiderata The SATZ System Tokenizer Partofsp eech Lo okup Representing Context The Lexicon Heuristics for Unknown Words Descriptor Array Construction Classication by Neural Network Network architecture Training Implementation Exp eriments and Results English Context Size Hidden Units Sources of Errors Thresholds Singlecase texts Lexicon size Adaptation to Some Other Languages German German News Corpus Suddeutsc he Zeitung Corpus i French Conclusions and Future Directions ii Intro duction The Problem The sentence is an imp ortant unit in many natural language pro cessing 1 tasks For example the alignment of sentences in parallel multilingual cor p ora requires rst that the individual sentence b oundaries b e clearly lab eled Gale and Church Kay and Roscheinsen Most partofsp eech 2 taggers also require the disambiguation of sentence b oundaries in the in put text Church Cutting et al This is usually accomplished by inserting a unique character sequence at the end of each sentence such that the NLP to ols analyzing the text can easily recognize the individual sentences Segmenting a text into sentences is a nontrivial task however since all 3 endofsentence punctuation marks are ambiguous A p erio d for example can denote a decimal p oint an abbreviation the end of a sentence or even an abbreviation at the end of a sentence An exclamation p oint and a question mark can o ccur within quotation marks or parentheses as well as at the end of a sentence The ambiguity of these punctuation marks is illustrated in the following dicult cases The group included Dr JM Freeman and T Boone Pickens Jr This issue crosses party lines and crosses philosophical lines said Rep John Row land R Conn The existence of punctuation in grammatical subsentences suggests the p ossibility of a further decomp osition of the sentence b oundary problem into typ es of sentence b oundaries one of which would b e emb edded sentence 1 Much of the work contained in this rep ort has b een rep orted in a similar form in Palmer and Hearst and p ortions of this work were done in collab oration with Marti Hearst of Xerox PARC 2 The terms sentence segmentation sentence boundary disambiguation and sentence boundary labeling are interchangeable 3 In this rep ort I will consider only the p erio d the exclamation p oint and the question mark to b e p ossible endofsentence punctuation marks and all references to punctua tion marks will refer to these three Although the colon the semicolon and conceivably the comma can also delimit grammatical sentences their usage is b eyond the scop e of this work b oundary Such a distinction might b e useful for certain applications which analyze the grammatical structure of the sentence However in this work I will address the lesssp ecic problem of determining sentence b oundaries b etween sentences In examples and the word immediately preceding and the word immediately following a punctuation mark provide imp ortant information ab out its role in the sentence However more context may b e necessary such as when punctuation o ccurs in a subsentence within quotation marks or parentheses as seen in example or when an abbreviation app ears at the end of a sentence as seen in ab a It was due Friday by pm Saturday would be too late b She has an appointment at pm Saturday to get her car xed Section contains a discussion of metho ds of representing context Baseline When evaluating a sentence segmentation algorithm comparison with the baseline algorithm is an imp ortant measure of the success of the algorithm A baseline algorithm in this case is simply a very naive algorithm which would lab el each punctuation mark as a sentence b oundary Such a baseline algorithm would have an accuracy equal to the lower bound of the text the p ercentage of p ossible sentenceending punctuation marks in the text which indeed denote sentence b oundaries A go o d sentence segmentation algorithm will thus have an accuracy much greater than the lower b ound Since the use of abbreviations in a text dep ends on the particular text and text genre the numb er of ambiguous punctuation marks and therefore the p erformance of the baseline algorithm will vary dramatically dep ending on text genre and even within a single text genre For example Lib erman and Church rep ort that the Wall Street Journal corpus contains p erio ds p er million tokens whereas in the Tagged Brown corpus Francis and Kucera the gure is only p erio ds p er million tokens They also rep ort that of the p erio ds in the WSJ corpus denote abbreviations lower b ound compared to only in the Brown corpus lower b ound Riley In contrast Muller rep orts lower b ound statistics ranging from to within the same corpus of scientic abstracts Such a range of lower b ound gures might suggest the need for a robust approach that can adapt rapidly to dierent text requirements Previous Approaches Although sentence b oundary disambiguation is an essential prepro cessing step of many natural language pro cessing systems it is a topic rarely ad dressed in the literature Consequently there are few published references There are also few public domain systems for p erforming the segmentation task and most current systems are sp ecically tailored to the particular cor pus analyzed and are not designed for general use Regular Expressions and Heuristic Rules The metho d currently widely used for determining sentence b oundaries is a regular grammar usually with limited lo okahead In the simplest imple mentation of this metho d the grammar rules attempt to nd patterns of characters such as p erio dspacecapital letter which usually o ccur at the end of a sentence More robust implementations consider the entire word preceding and following the punctuation mark and include extensive word lists and exception lists to attempt to recognize abbreviations and prop er nouns There are several examples of rulebased and heuristic systems for which p erformance numb ers are available Christiane Homann used a regular expression approach to clas sify punctuation marks in a corpus of the German newspap er die tageszeitung with a lower b ound of She used the UNIX to ol lex Lesk and Schmidt and a large abbreviation list to classify o ccurrences of p erio ds accord ing to their likely function in the text Tested on p erio ds from the corpus her metho d correctly classied over of the sentence b oundaries The metho d was develop ed sp ecically for the tageszeitung corpus and Ho mann rep orts that success in applying her metho d to other corp ora would b e dep endent on the quality of the available abbreviation lists 4 Gabriele Schicht over the course of four months develop ed a metho d for segmenting sentences in a corpus of the German newspap er die Suddeutsche 4 At the University of Munich Germany Zeitung The metho d uses

Load more