c

Natural Language Engineering Cambridge UniversityPress

Regular Expressions for Language Engineering

L KARTTUNEN JP CHANOD G GREFENSTETTE A SCHILLER

Rank Xerox Research CentreRXRC Chemin de Maupertuis Meylan France

fLauriKarttunenJeanPierreChanodGregoryGrefenstetteAnneSchillerg

grenoblerxrcxeroxcom

Received February

Abstract

Manyofthe pro cessing steps in natural language engineering can b e p erformed using

nite state transducers An optimal waytocreate such transducers is to compile them

from regular expressions This pap er is an intro duction to the calculus

extended with certain op erators that haveproved very useful in natural language appli

cations ranging from tokenization to lightparsing The examples in the pap er illustrate

in concrete detail some of these application s

Intro duction

The use of nite state transducers for morphological analysis and generation Kart

tunen et al Karttunen Beesley and Karttunen is bynowwell es

tablished It is less well known howfarthe techniques of nite state transformation

of text can succeed in solving other natural language engineering problems b eyond

morphologyIn this pap er wewillpresentanumber of applications of nite state

technology to other language engineering problems and describ e part of the regular

expression calculus that wehavedevelop ed that makethese applications p ossible

Although large eorts havebeen made to build lo cal grammars Silb erztein

without the help of a nite state calculus the expressivepower of a welldesigned

calculus makes it p ossible to create mo dular rule sets and lexical descriptions that

are easy to up date and maintain accelerating the pro duction of diverse engineering

applications

Our pap er is structured in the following way After a general intro duction to the

nite state calculus sp ecial attention will b e given to three nite state op erators

restriction replacement and left to right longest matchreplacement These con

ceptually simple regular expression op erators express in a concise way conditions

whose explicit formulation would otherwise b e complicated and unwieldyTheir

intro duction into the Xeroxnite state calculus has spurred a numberoflanguage

engineering applications some of which are describ ed b elow A graded set of exam

ples using these op erators will b e presented as a concrete illustration of their p ower

and simplicity

After this overview of the Xeroxnite state calculus wewill presentanumber of

Lauri Karttunen and others

applications using this calculus over textual structures larger than individual words

In Section wepresenttokenization applications Subsection presents a de

terministic tokenizer used in our part of sp eechtagging suite Nondeterministic

tokenization presented in subsection maintains alternativetokenizations of

the same input stream in a nite state structure to whichfurther linguistic treat

ments eg can b e applied

Subsection presents twoapplications of lightparsing over tagged text Just

as tokenization applies nite state rules and constraints over strings involving more

than one word these applications apply to units here sentences recognized bythe

part of sp eech tagger whichinclude more than one word And just as transducers

compiled from morphological rules insert or delete characters in words these light

parsers intro duce syntactic markings within a tagged sentence around sequences

that fulll the conditions sp ecied in the nite state rules

Intro duction to the FiniteState Calculus

Finite state transducers that enco de phonological and morphographical alterna

tions are generally not created directly but are compiled from rules of some sort

The twolevel rule formalism Koskenniemi as well as classical phonological

rewrite rules can b e viewed as sp ecial kinds of regular expressions Kaplan and

Kay that extend the basic language with new op erators and new typ es of

expressions These extensions makethe formalism more suitable for particular ap

plications but they do not aect its generativepower In this section wewilldiscuss

recentextensions to the regular expression calculus that enable us to create nite

state transducers for syntactic pro cessing from high level rules

We start with a brief review of the basic regular expression calculus and pro ceed

with the discussion of twospecial op erators that were originally intro duced for

phonological and morphographical alternations but whichhavesince proved useful

in writing syntactic rules as well These are the restriction op erator and the

replace op erator Wewill then describ e in more detail a variantofthelatter

the left to right longest matchreplacement recently intro duced in Karttunen

and illustrate its application to syntactic description

Simple regular expressions

The language of regular expressions is a similar to formulas of

Bo olean logic It has a simple syntax but the expressions can b e arbitrarily com

plex Like formulas of Bo olean logic regular expressions denote sets We need to

distinguish two kinds of sets sets of strings and sets consisting of pairs of strings

We use the term language to refer to a set of simple strings and the term relation in

talking ab out sets of string pairs The terms and regular relation

refer to sets than can b e describ ed bya regular expression

Regular languages and relations maybeencoded as nite state networksLan

guages are represented by simple automata relations by transducers Anyregular

expression can b e compiled into a network that represents the corresp onding lan

Regular Expressions for Language Engineering



guage or relation Because of the close connection b etween regular expressions and

nite state networks weoftenuse the terms regular and nite state interchangeably

in this pap er

Regular expressions contains two kinds of symb ols unary symbols and symbol

pairs Unary symbols a betc denote strings symbol pairs ab a betc

representpairs of strings The simplest kind of regular expression contains a single

symbol For example a denotes the set fagSimilarlythe regular expression

ab denotes the singleton relation fa bgA regular relation mayalways

be viewed as a mapping b etween tworegular languages The ab relation is simply

the crosspro duct of the languages denoted bythe expressions a and b

In order to distinguish the two languages that are involved in a regular relation

we can call the rst one the upper and the second one the lower language of the

relation Corresp ondinglyinthe pair abtherst symb ol acanbe called the

upper symbol and the second one bthe lower symbolThetwocomponents of a

symbol pair are separated in our notation byacolonwithout anywhitespace

b efore or after Tomakethe notation less cumb ersome wesystematically ignore

the distinction b etween the language A and the identityrelation that maps every

string of A to itself Therefore wealso write aa simply as a

A transducer that enco des a regular relation can b e applied in twoways to map

a string in the upp er language to the corresp onding strings in the lower language

or vice versa There is no privileged direction of application In the examples in

later sections the intended input language is always on the upp er side and the

output on the lower side of the relation

Two regular expression symbols havea special interpretation epsilon and

any The epsilon symbol denotes the empty string stands for anysymbol

that o ccurs in the same regular expression and for anyunknown symbol In alater

section weintro duce one more sp ecial symbol the b oundary marker torefer

to the b eginning or the end of a string in certain expressions

The sp ecial meaning of any symbol maybeturnedobyprexing it with the

escap e character or byenclosing the symbol in double quotes and are

interpreted as ordinary zero digits and not as the sp ecial character epsilon On the

other hand certain other strings receiveaspecial interpretation within a doubly

quoted string For example n is interpreted as the newline character t as

a tab following the C programming language conventions Because whitespace is

generally ignored a space as a regular symbol must b e prexed with or enclosed

in double quotes

Complex regular expressions can b e built up from simpler ones bymeans of

regular expression op erators Because b oth regular languages and regular relations

are closed under concatenation and union the following basic op erators can b e

combined with any kind of regular expression



See httpwwwrxrcxeroxcomresearchmlttfst for a demonstration of the Xe

rox regular expression and for a more comprehensivedescription of the syntax

and semantics of regular expressions

Lauri Karttunen and others

A B Union

AB Concatenation

A Optionality union with the emptystring

A Iteration one or more concatenations of A

A equivalentto A

Square brackets are used for grouping expressions Thus A is equivalentto

A while A is not Note the following simple expressions

The emptystring languageidentityrelation

The universal languageidentityrelation

Although regular languages are closed under complementation and intersection

regular relations are not see Kaplan and Kay thus the following op erators

can b e combined only with expressions that denote a regular language

A Complement negation

A B Intersection

A B Relativecomplementminus

Regular relations can b e constructed bymeansof twobasic op erators

AxB Crosspro duct

AoB Comp osition

The crosspro duct op erator xisused only with expressions that denote a

regular language it constructs a relation b etween them A x B designates the

relation that maps every string of A to every string of BIfA contains x and B

contains ythe pair x y is included in the crosspro duct

Comp osition is an op eration on relations that yields a new relation A o B

maps strings that are in the upp er language of A to strings that are in the lower

language of BIfA contains the pair xy and B contains the pair y zthepair

x z is in the comp osite relation

Dening New Operators

The syntax though not the descriptivepower of regular expressions can b e ex

tended bydening new op erators that allowcommonly used constructions to b e

expressed more conciselyA simple example of a trivial but convenient extension is

the containmentoperator

A A

def

For example a b denotes all strings that contain at least one a or b

somewhere

The addition of new op erators can b e more than just a notational convenience A

case in p ointis Koskenniemis restriction op erator originally intro duced

for twolevel phonological rules

ABC Restriction A only in the context of BC

Regular Expressions for Language Engineering

Here A B and C maydenote any regular language This expression designates

the language of strings that havetheprop ertythatany string of A that o ccurs

in them is immediately preceded bysome string from B and immediately followed

by some string from CFor example abcincludes all strings that contain

no o ccurrence of a strings like bacbac that completely satisfy the condition

but no strings likeab Reductionist nite state parsers Koskenniemi et al

Voutilainen and Tapanainen Chano d and Tapanainen makefrequent

use of such constraints to exclude unwanted analyses

The advantage of the restriction op erator is that it enco des in a compact waya

useful condition that is dicult to express in terms of the more primitiveoperators

The denition of A B C is shown b elow

ABC B A A C

def

Clearlyhigh level abstractions like A B C are conceptually easier to op

erate with than the logically equivalentbutvery complex primitiveformulas just

as it is easier to write complex computer programs in a high level language rather

than in a logically equivalent assembly language

Note that the denition of the restriction op erator given ab ove contains three

negations Because regular relations are not closed under complementation all the

comp onent expressions in A B C must denote regular languages rather than

relations In the general regular expression calculus wedonotallowexpressions such

as a b c that are wellformed twolevel rules in Koskenniemis formalism

In our twolevel calculus all pair symbols are treated as atomic symbols andcon

verted to real symbolpairs only after the compilation is nished Karttunen and

Beesley

Another example of a useful high level abstraction is the replace op erator

that plays much the same role in the general regular expression calculus as

and in twolevel rules This simplest typ e of replacementisunconstrained by

contexts

AB ReplacementofAbyB

The comp onentexpressions A and Bmust denote regular languages but the

expression as a whole denotes a relation The A B relation maps anystring

to itself if the string contains no instance of AStrings that contain instances of A

are paired with copies that are otherwise identical except that each A segmentis

replaced by some B string The exact denition of ABis shown b elow

AB A A x B A

def

This relatively simple idea would b e also rather cumb ersome to express without

the explicit replace op erator The same is true of the other typ es of replace

expressions Kaplan and Kay Karttunen that constrain the op eration

by left and rightcontexts

ABR Replacementof A by B in the context LR

Lauri Karttunen and others

The twovertical bars in the ab ovecontexted replacementindicate that b oth

contexts L and R p ertain to the upp er language of the relation See Karttunen

Kemp e and Karttunen for other variations

Another wayof generalizing the replace op erator is to maketwoormore replace

ments in parallel

A B A B

This is dened like the single A B replacementabove except that wereplace

A by A A and A x B bytheunion of multiple

crosspro ducts A x B A x B

Left to right Longest match Replacement

The transducers compiled from the simple replacement expression ABare in

general ambiguous in the sense that a string in the upp er language of the relation

maybecome paired with more than one string This can happ en even if the B

language consists of a single string For example a b bbaabax

maps the string aba into four dierent strings as shown b elow

aba aba ab a aba

x a axa a x x

The reason is that the simple replacement relation do es not constrain the selection

of the alternate substrings for replacement Here wegetdierentresults for aba

dep ending on whether the replacementstarts at the b eginning or in the middle of

the string Atboth sites there are two alternative replacements to b e made Starting

at the b eginning wemay relace either ab or aba Starting in the middle wecan

replace either b or ba The underlining showthefour alternate factorizations

of the input string

For manyapplications it is useful to dene another version of replacementthat

in all such cases yields a unique outcome The longest match left to rightreplace

op erator dened in Karttunen imp oses a unique factorization on ev

ery input The replacement sites are selected from left to right not allowing any

overlaps If there are alternate candidate strings starting at the same lo cation only

the longest one is replaced Thus the op erator allows only the last factorization

in the gure ab ove mapping aba unambiguously to x

The eect of the left to right longest matchconstraints is that every string in

the upp er language of A B is uniquely parsed into a sequence of substrings

that either b elong or do not b elong to AWe can take advantage of the unique

factorization in more than one way Instead of replacing the instances of A bya

string from some other language wemayinsert markers or brackets around the A

strings to mark them as such

Toimplementthis idea Karttunen intro duced a sp ecial symbol on

the righthand side of the replacementexpression to mark the place around which the insertions are to b e made The general form of these marking expressions is

Regular Expressions for Language Engineering

shown b elow For the sakeofgeneralityweallow B and C to denote anyregular

language

ABC Left to right longest matchmarkup

The corresp onding transducer lo cates instances of A in the input string under the

left to right longest matchregimen copies the entire string unchanged except that

the B and C strings are inserted around the selected A strings as markers In eect

this transducer can b e viewed as a parser that picks out maximal instances of the

regular language A

Just like simple replacement the left to rightlongest match replacementcanalso

be constrained bycontext and generalized for parallel replacement For example

A B C A B C

picks out the maximal instances of the union A A and marks them dierently

dep ending on whether they b elong to A or AWewill makeuseof this sp ecic

p ossibilitybelowinSection The contexted version of makes its app earance

in Section

Tostart with a simpler example let us assume that noun phrases consist of an

optional determiner danynumber ofadjectives a and one or more nouns n

The expression d a n compiles into a transducer that inserts

brackets around maximal instances of the noun phrase pattern For instance it

maps dannvaan into dannvaan

dann v aan

d annv aan

Although the input string dannvaan contains manyother instances of the noun

phrase pattern n an nn etc the left to rightandlongest matchconstraints

pickout just the two maximal ones

This simple example demonstrates that nite state parsers can b e compiled di

rectly from regular expressions In the next section wewillpresentamore sophis

ticated example illustrating this technique

AGrammarand a Parser for Date Expressions

It is wellknown among linguists that the syntax of a natural language cannot in

general b e describ ed byanite state or even a context free grammar Nevertheless

there are manysubsets of natural language that can b e correctly describ ed byvery

simple means for example names and titles addresses prices dates etc For some

of these kinds of expressions a nite state grammar maybemore appropriate and

easier to construct than an ordinary phrase structure or feature based grammar

In this section weexamine one suchcase in detail a grammar for dates As we

demonstrated in the previous section from the regular expression that denes the

syntax of wellformed date strings wecan directly deriveanite state transducer that marks them in a text

Lauri Karttunen and others

For the sakeofillustration let us consider here only one of several common date

formats expressions that are of the typ e

Sunday

August

SundayAugust

August

SundayAugust

In the following we assume that a date expression consists of a dayoftheweek

amonth and a date with or without a year or a combination of the two Note

that this description of the syntax of date expressions leads to the same problem

weencountered in the previous example Long date expressions suchasSunday

August contain smaller wellformed date expressions eg August

that should b e ignored in the context of a larger date In order to simplify the

presentation westipulate that date expressions are contiguous strings including

the internal spaces and commas

Tofacilitate the sp ecication of the date language werst dene some auxiliary

terms and then use them to dene larger phrases The complete set of denitions

is shown b elow

To

To To

SP

Day Monday Tuesday Saturday Sunday

Month January February November December

Date ToTo

Year To To To To

DateExpression Day Day SP Month Date SP Year

From these denitions wecancompile a small nite state automaton states

arcs that describ es a language of ab out million date expressions for the

p erio d from January to December

Aparser for the language can b e compiled from the following simple regular

expression

DateExpression

It yields a transducer of states and arcs that marks maximal date expressions

in the manner illustrated bythe following text

Today is Wednesday August because yesterday was Tuesday

and it was August so tomorrow must be ThursdayAugust and

not August as it says on the program

Because of the left to right longest matchconstraints asso ciated with the

op erator the transducer brackets only the maximaldate expressions

However as to correctness this regular expression grammar suers from a serious

problem of overgeneration The actual number of days in years is much smaller

than the numberofexpressions in the language million A large ma jorityof

the date expressions generated bythegrammarrefer to dates that do not exist For

example expressions like April are similar to examples likethepresent

Regular Expressions for Language Engineering

king of France that do not refer to anything real It is an interesting challenge for

nite state syntactic description to try to sp ecify a sublanguage that contains all

and only the semantically valid date expressions

Eliminating Invalid Date Expressions

In this section we eliminate step bystepall the dierenttyp es of invalid date ex

pressions using the op erations of the regular expression calculus intro duced earlier

The easiest problem to correct is that the original date grammarallows expres

sions likeFebruary and April that exceed the maximum number of days

for the month Somewhat more challenging is the case of leap days February

and February are valid dates but February and Febru

ary do not exist The hardest problem is the dep endency b etween the

dayof the week and the date Because September is in fact a Monday

the expressions TuesdaySeptember Wednesday September

etc are invalid even if they o ccasionally o ccur in real texts

Our solution is to construct a suitable constraintforeachofthesethree kinds

of invalid typ es of dates MaxDaysInMonth LeapDaysand WeekDayDatesEachof

these constraintexpressions denotes a language that excludes a particular typ e of

invalid date but admits all other strings Weobtain the desired eect byintersect

ing the constraintlanguages with the original language of date expressions The

intersection of the four languages contains all and only the valid dates

ValidDate

DateExpression MaxDaysInMonth LeapDays WeekDayDates

The MaxDaysInMonth constraintisa language that includes all strings except the

ones whichcontain a month name followed byaninappropriate number ofdays

MaxDaysInMonth

February

February April June September November

Note that A denotes the complementnegation of the set of strings that

contain at least one instance of A somewhere

In order to restrict February to leap years we need to do a little more work

Not all years divisible by four are leap years Full centuries are not leap years unless

they are divisible by Consequentlyyear is not a leap year but year

is Weneed the following denitions

Even

Odd

N To To

Div N Even NOdd

LeapYear Div N Div

Here werstdene Div as the innite set of natural numb ers that are divisible

by four This set consists of two parts numb ers that end in or p ossibly

preceded byaneven number and numb ers that end in or preceded byan

Lauri Karttunen and others

odd numb er Finallywe dene LeapYear as the set of numbers divisible by

subtracting centuries that are not multiples of Note that the expression N

Div denotes numbers with twonal zeros that are preceded byanumber

that is not divisible by four For example it includes but not Because

LeapYear is dened as Div minus this set it follows that is a leap year but

is not

Once the set of leap years is dened the distribution of February in date

expressions can b e constrained with the following simple restriction As dened

ab ove SP is a separator consisting of a comma and a space

LeapDays February SP LeapYear

In other words a date expression containing February must terminate

with a leap year Note that the b oundary symbol isnecessary here to mark

the end of the year string in order to rule out expressions likeFebruary

whichwould qualify if wewere allowed to takeinto accountonlythe rst three

digits since year is a leap year in the Gregorian calendar

The last problem synchronization b etween the days of the week and the dates

takes a little more work In eect weneed to construct a complete calendar to

determine the validityofany arbitrary date expression sayFriday Octob er

the day the Gregorian calendar was intro duced in Catholic countries

The task is easier than it rst app ears b ecause the Gregorian calendar includes

aperfect year double cycle after eachyears we start the year on the same

dayandwe are at the same p ointinthe leap year sequence Starting from that

observation it is simple to partition years into equivalence classes so that the years

in eachclass b egin on the same dayoftheweek Furthermore we need to distinguish

ordinary years from leap years that haveoneextra dayThus wecan dene fourteen

equivalence classes

MonY

TueY

MonLY

TueLY

where MonY TueY etc are ordinary years b eginning with MondayTuesdayetc

resp ectively MonLY TueLYetc denote the corresp onding classes for leap years

The year classes can b e constructed using the same sort of nite state arithmetic

that we employed ab oveto dene the innite set of leap years but wewillskip the

mechanics here except for one interesting detail In order to avoid listing years one

by one itisconvenienttoderivesub classes from other sub classes by addition This

can b e achieved byanauxiliary transducer that maps any set of numb ers x y z

to another set containing x y z etc Applied in the opp osite direction



the same transducer subtracts



A binary version of the addersubtracter transducer is one of the examples discussed

in httpwwwrxrcxeroxcomresearchmlttfstfsexampleshtml

Regular Expressions for Language Engineering

Another simple observation that greatly facilitates the calendar construction is

that the weekdayofthe rst dayoftheyear determines the weekdayofalltheother

days of the year Again weneed fourteen equivalence classes

D January January December

D January January December

DL January January December

DL January January December

where D includes all the dates that fall on the same weekdayasthe rst dayof

an ordinary year D contains all the dates that share the weekdaywith the second

day and so on The DL DL etc are the corresp onding classes for leap years

Toconstruct the nal lter for the p erfect Gregorian calendar we just need to

combine the information ab out what weekdaybeginswhatyear with the equivalence

classes for the dates within a year For example for an ordinary Mondayyear

MonY Mondays fall on January January and December But if the

year b egins on SundaySunY then Mondays fall on January January etc

up to December Similarly for all the fourteen equivalence classes for years

The complete constraint for Mondays MondayDates is expressed bythe following

regular expression

MondayDates

Monday SP DSP MonY DL SP MonLY

D SP SunY DL SP SunLY

D SP SatY DL SP SatLY

D SP FriY DL SP FriLY

D SP ThuY DL SP ThuLY

D SP WedY DL SP WedLY

D SP TueY DL SP TueLY

The language denoted by the expression ab oveincludes all strings that do not

contain any instance of Monday and all strings that contain Monday in an

environment that conforms to the restriction the end of the string Monday

any month and a date without a year MondayDecember or a month and

a date followed byanappropriate year MondayJanuary If a month

and a date are followed byayear the month and date class D DL Detcmust

correlate with the year class MonY MonLY SunY etc

The complete WeekDayDates constraintissimplytheintersection of the similar

constraints for all the seven weekdays

WeekDayDates MondayDates TuesdayDates WednesdayDates

ThursdayDates FridayDates SaturdayDates

SundayDates

Wehavenow completed the task of extracting the language of valid dates from

the set of all date expressions it is the intersection of the four languages wehave

dened

ValidDate DateExpression MaxDaysInMonth

LeapDays WeekDayDates

Lauri Karttunen and others

If we consider the nite state network that enco des the language wecansee

that the numberofexpressions is considerably reduced since there are only ab out

ab out million days in b etween year and year On the other hand the

network itself is muchlarger than the unconstrained DateExpression automaton

states arcs

In any case wecannow easily deriveatransducer that marks all and only the

valid dates

ValidDate

With the Xeroxregular expression compiler the entire computation creating the

dateparsing nite state transducer takes ab out seconds on a p owerful worksta

tion Sun Ultra

Discriminating Date Parser

Although it is desirable to distinguish valid dates from invalid dates it may not b e

useful in practice to recognize only the valid dates Because of errors and misprints

real text corp ora contain a fair number of invalid dates Instead of ignoring all but

the valid ones it maybemore practical to accept all date expressions and simply

use dierent tags to mark the distinction b etween valid and invalid dates

Once wehavedenedboth the language of all date expressions DateExpression

and the set of valid ones ValidDateitis simple to pickoutthe set of invalid dates

DateExpression ValidDate Using the notion of parallel replacementintro

duced earlier wecan easily construct a parser that recognizes maximal instances

of the general class of date expressions but tags them in twoways dep ending on

whichsub class the expression b elongs to The discriminating date parser is dened

by the regular expression b elow

DateExpression ValidDate ID

ValidDate VD

This parallel replacementexpression compiles into a state arc trans

ducer in ab out seconds on a Sun Ultra The time includes the compilation of all

the auxiliary expressions and constraints discussed ab ove The following example

illustrates the eect of the transducer on a sample text

The correct date for today is VD Monday September There

is an error in the program Today is not ID Tuesday September

The string MondaySeptember gets marked with the VD tag

b ecause it is a valid date Replacing Monday byTuesday makes the date invalid

but the parser still recognizes the expression TuesdaySeptember as a

date expression alb eit as an invalid one Because the longest matchconstraintis

calculated with resp ect to the language of all date expressions the invalid date

TuesdaySeptember gets selected over TuesdaySeptember

whichhappens to b e valid although the Gregorian calendar of course was not yet

in use in the year AD

Regular Expressions for Language Engineering

Toconclude this section let us recapitulate the main p oints The purp ose of this

exercise was to demonstrate that

Finite state parsers can b e constructed directly from regular expressions for

nontrivial languages

There are regular of natural language suchasthelanguage of dates

for whichnite state description is not only feasible but more appropriate

and easier to construct than the equivalentphrase structure grammar

Regular languages and relations can b e mo died directly with the nite state

calculus to obtain new languages and relations without rewriting the gram

mars that describ e them

It is of course p ossible to use phrase structure rules attributevalue matrices

typ e hierarchies categorial grammar and other p owerful formalismstorepresent

leap years years starting on a Mondayand dates that fall on the same weekdayas

January However it is not apparent that these p owerful formalisms giveusany

advantage over the simple nite state techniques wehaveused On the contraryit

seems that the manipulations required to accomplish suchtasks with more p owerful

formalismswould b e at least as complicated and ad ho c as the simple nite state

techniques employed here Note that a phrase structure grammar of valid dates

requires crossclassication whichleads to a large numberofrules and nonterminal

categories

If the language to b e describ ed is in fact regular there maybea signicant

advantage in describing it by means of a instead of using a more

powerful grammar formalismInthe example just discussed wewere able to obtain

the language of invalid dates bysubtracting one regular language from another one

There would b e no straightforward metho d to achieveasimilar result starting with

a phrase structure grammar since context free languages in general are not closed

under complementation

The next sections will add more examples supp orting this general line of argu

ment and present other applications of the nite state calculus to natural language

pro cessing

Applicationsofthenite state calculus

Tokenization

One of the very rst steps in anynatural language pro cessing system is applying a

tokenizer to the input text A tokenizer is a device that segments an input stream

into an ordered sequence of tokenseach token corresp onding to an inected word

form a number a punctuation mark or other kind of unit to b e passed on to sub

sequentpro cessing If the output never contains alternative segmentations for any

part of the input the tokenizer is called deterministic Deterministic tokenization

is commonly seen as an indep endentprepro cessing step unambiguously pro ducing

items for subsequentmorphological analysis

In our approach tokenization is an integral part of language pro cessing whichcan

Lauri Karttunen and others

be adapted to the needs of the subsequentanalysis steps Dep ending on the following

steps one mightwanttoinvokedierenttokenization algorithms In section

wedescrib e a deterministic tokenizer whichisuseful for sto chastic part of sp eech

disambiguation Then in Section wesketchasituation in whichwemight need

nondeterministic tokenization and describ e howthatisachieved

Deterministic Tokenization



The simplest kind of deterministic tokenizer is an unambiguous transducer that

splits the input stream into a unique sequence of tokens one p er line taking into

accountsomegeneral character classes but not using anylanguagesp ecic infor

mation

Letter A BC x yz

WhiteSpace t n

Other Letter WhiteSpace

With these denitions wecan compile a simple tokenizing transducer that in

serts newlines to mark token b oundaries from the following regular expression It

represents the comp osition of three simple replace relations as explained b elow

WhiteSpace

o

Letter n Other n

o

n

The rst relation reduces strings of whitespace characters into a single blank

using longest match replacement The second inserts a newline as a token b oundary

after longest matches of letter sequences and other nonwhitespace sequences The

third formula a contexted replace expression denotes a relation that eliminates

anyinitial space and all spaces that followa token b oundaryThe comp osite single

transducer consists of states and arcs It is unambiguous every sequence of

input characters is mapp ed into a unique sequence of tokens

Exactly the same tokenization can also b e obtained bycompiling the three re

placements separately and applying the resulting transducers in a sequence each

mo difying the output of the previous one This step by step approach a cascade of

transductions will b e discussed in section

Extending this simple tokenizer wecancreate more sophisticated tokenizer trans

ducers whichare languagesp ecic In order to divide the input text into sentences

and words recognized by subsequentprocessing steps these tokenizers need to know

ab out abbreviations conjoined clitics and multiword expressions that contain in

ternal spaces hyphens and other sp ecial symbols A lexicon compiled as a nite



The most ecienttransducer is not only unambiguous but sequential as well A se

quential transducer pro duces a unique mapping without ever exploring anyalternatives

Mohri An unambiguous transducer can b e sequentiali zed if all lo cal ambiguities

can b e resolved with the help of a limited amountoflookahead Tokenizing transducers

are in general of this typ e

Regular Expressions for Language Engineering

state language TokenLexicon b elow containing suchexceptional tokens can b e

included in the middle line of the preceding formula

Letter TokenLexicon n Other n

Because we use the longest matchoperator anyspaces and punctuation charac

ters in the entries of TokenLexicon are mapp ed to themselves and do not count

as token b oundaries For example if the lexicon includes the strings tu l

and acote de a Frenchtokenizer would split the string Voistu larbre acote

de la maison Do you see the treenexttothe house into the following tokens

tokenb ounding newlines are written as k

Voisktuklkarbrekac otedeklakmaisonkk

Multiword tokens are of several typ es

adverbial expressions all of a sudden a priori to and fro

prep ositions in frontofinspite of with resp ect to

dates January nd time expressions am

prop er names New York Rank Xerox

and other units

Aword of warning is in order An unambiguous tokenizer analyzes every multi

word expression consistently as a single token even if the comp onentwords should

be separated in a given context For example if in general is included in the

multiword lexicon then it will b e tokenized as a unit in b oth of the examples in

general he and in general meetings Therefore multiword lexicons used for

unambiguous tokenization must b e carefully and conservatively designed

Tokenization by step by step transduction

If the multiword lexicons are very large their compilation into a single transducer

maylead to time and space problems Moreover dierentNLP applications may

require dierentmultiword lexicons For these reasons it maybeadvantageous to

use another approach based on multiple transducers that apply in a sequence to

yield exactly the same end result as the single tokenizer just describ ed In general

suchasequence consists of

a basic tokenizer which segments anysequence of input characters into simple

tokens ie no multiword units and

one or several multiwordstaplers whichidentify multiwords and group them

together as single units

The basic tokenizer is compiled as describ ed in the previous section The multi

word staplers are built in the following wayWerst dene a multiword language

MWLcontaining the units to b e recognized The denition assumes the basic tok

enization has already b een done the internal word separator is a newline instead

of a space For example if the multiword expressions consist of ad ho c and and

so on wedene the language as follows

Lauri Karttunen and others

MWLadnhocandnsonon

In order to dene a relation that staples together the MWL expressions it is

useful to start with some auxiliary denitions

BEG END BND BEG END LIM n

The BEG and END brackets are markers for the multiword string The LIM expres

sion is used to checkthe surrounding context making sure that the b eginning and

the end of the candidate multiword expression are not part of some other token

The stapler is comp osed from the three auxiliary relations b elow

Identify BND o MWL BEG END LIM LIM

Staple n BEG BND BND END

Cleanup BND

The Identify relation wraps the multiword expressions in MWL inside a pair of

auxiliary brackets under the left to right longest matchregimen imp osed by



and under the constraint that the multiword string is prop erly delimited The

Staple relation converts every internal newline in a marked region into a space

leaving the nal one unchanged The Cleanup relation eliminates the auxiliary

brackets

The multiword stapler for the MWL expressions is the comp osition of the three

relations dened ab ove

Stapler Identify o Staple o Cleanup

The sequential application of the basic tokenizer and the multiword stapler is

illustrated in the gure b elow As b efore weuse k to representthenewline symbol

in order to save space

one two and so on

basic tokenizer

onekktwokkandksokonkk

multiword stapler

onekktwokkand so onkk

The basic tokenizer can of course b e comp osed with the multiword staplers to

form a single larger transducer if increasing the sp eed of application is more im

p ortantthan the size of the network

If the stapling of some multiword expressions is made optional the tokenization

b ecomes nondeterministic b ecause the multiword interpretation of the MWL string

is an alternativeto the sequence of single word tokens pro duced bythe basic tok

enizer The next section discusses some cases where it is advantageous not to use

deterministic tokenization



The same logic could b e enco ded without the contexted version of the op erator but

the details would b e quite a bit more complicated The problem is to avoid errors such

as picking out ad ho c as a multiword in the middle of bad ho ck

Regular Expressions for Language Engineering

Nondeterministic Tokenization

The deterministic treatmentofmultiword expressions as single tokens is problem

atic b ecause manysuch expressions havealternate analyses in dierentcontexts

For instance the string de meme in Frenchcan b e treated as a single token

meaning similarly orasequence of two indep endent tokens the prep osition de

of followed bythe adjectivememe same Ifthe unambiguous tokenizer makes a

wrong choice it maylead to a parse failure or incorrect semantic interpretation In

suchcases a cautious tokenizer pro duces alternative segmentations p ostp oning the

decision to a later pro cessing stage

With the techniques just intro duced it is easy to makea tokenizer that pro duces

alternativesegmentations for some strings We start by creating a sp ecial multi

word lexicon for strings suchasdememe that should b e analyzed either as a

single token or as a sequence of tokens If we are using the step bystepapproach

in the previous section weintro duce into the cascade a second ambiguous stapler

transducer that optionally adjusts the output of the basic tokenizer for these p oten

tial multiword items This optional stapler is dened exactly like the unambiguous

stapler except that weinclude the universal identityrelation toallowforany

string to b e mapp ed to itself

OptionalStapler Identify o Staple o Cleanup

This optional stapler maps the output of the basic tokenizer dekmemek into

b oth dekmemek identity and de memek stapled

Making tokenization nondeterministic solves one problem but intro duces another

one The subsequentstages of pro cessing havetodeal with the ambiguous represen

tations of the input This problem was rst addressed in the context of a constraint

based nite state parser for FrenchChano d and Tapanainen Chano d and

Tapanainen that builds a nite state network for the input sentence that

represents not only the alternative tokenizations but also all the additional ambi

guities arising from the morphosyntactic analysis of the tokens Each path through

the network represents one p ossible tokenization and one p ossible morphosyntactic

analysis for eachtoken Because of alternative tokenizations the paths in general

do not have the same number of tokens

Atthis level some of these paths can b e quickly eliminated bysyntactic con

straints Syntactic constraints are expressed as regular expressions typically con

taining the restriction op erator and compiled to networks These automata are

intersected with the sentence network to prune out unwanted readings In par

ticular they removeunacceptable tokenization schemes unless the ambiguityis

syntactically acceptable

For instance the sentence De meme les b otes de meme format sontclassees

ensemble Similarly the boxes of same format areclassiedtogether is ambiguous

at the tokenization level b ecause of the ambiguous string de meme This leads



to four dierent paths in the input network as far as tokenization is concerned



There are actually manymorepaths as the input network represents not only tokeniza

tion ambiguitybutalso syntactic ambiguity

Lauri Karttunen and others

De même de même

les boîtes format sont classées ensemble

De même de même

However after syntactic analysis there remains only one analysis where the rst

de meme is recognized as a multiword adverbial while the second one is decom

p osed into twoindep endenttokens

de meme Adv Cap MWE Adverbial

le InvGen PL Def Det NounPrMod

bote Fem PL Noun Subject

de Prep

meme InvGen SG Adj NounPrMod

format Masc SG Noun PPObj

etre IndP PL P Verb Copl Auxi MainV

classer PaPrt Fem PL Verb PastPart

ensemble Adv Adverbial

This is due to the syntactic constraints that reject unwanted analyses including

incorrect tokenizations For instance the path where the twooccurrences of de

meme are split into two tokens is rejected by several constraints among whichis

the following constraint

Prep

Coord Prep a

NounHead VerbHead Prep PPObj b

NounHead VerbHead Prep Inf PresPart c

Adverbial d

NounPsMod e

This constraint allows the Prep symbol to app ear only in the given vecontexts

that describ e the p ossible continuations for a prep osition in a Frenchsentence

a b efore a co ordinated prep osition

b b efore a PPOBj without prior head nouns or head verb

c b efore an innitival or participial verb

d b efore an adverbial

e b efore an adjectiveonthe condition that it is a noun p ostmo dier

None of these contextual constraints accepts the sequence Prep NounPrMod Det

Subject on the path corresp onding to Dekmemekleskbotes This path is then

eliminated from the input sentence network

If for a given sentence twoambiguous tokenization paths are syntactically ac

ceptable they are b oth preserved after intersection with the constraint networks

This is what happ ens with the sentence Je p ense bien quil parle where bien

que is ambiguous meaning although as a multiword token The sentence can b e

Regular Expressions for Language Engineering

read either as lit I think well that he sp eaks ie Idothinkthat he speaks or

I think although he speakswhichleads to tworemaining analyses paths in the

output sentence network

Light Parsing by Marking and Filtering Transducers

The preceding subsections have shown applications of the nite state calculus to

parts of natural language pro cessing suchastokenization and morphological analy

sis used by part of sp eechtagging Earlier Section showed that nite state parsers

can b e created for some subsets of natural language suchascorrect and incorrect

dates Here we presentafull scale lightparser built from the nite state calculus

describ ed in Section

For some large scale text applications suchasterminology extraction lexicog

raphyor information retrieval a parser must recognize recurring lexical syntactic

patterns and their variants The parser need b e no more p owerful than is necessary

to recognize these patterns Manysuchparsers Joshi Debili Grefen

stette Abney App elt et al employ recognizers over part of sp eech

tagged text byrst marking contiguous patterns suchasnounandverb groups

then marking heads within groups and then extracting patterns b etween noncon

tiguous heads Some of these parsers mix nonnitestate pro cedures with nite

state recognizers but weshowherethattheentire parser can b e built within a

nite state framework

A nite state transducer that intro duces extra symbols into an input string can

be considered as a nite state markerAs seen in Section the longest match

op erator can b e used to intro duce marks around nominal groups and verbal groups

and simple transducers can b e used to mark head words within these groups A nite

state lter is a transducer whichoutputs only certain parts of the input string

setting all the other parts of the string to epsilon and p ossibly intro ducing a lter



indicating lab el Schematically a ltering transducer has the following structure

RelationLabel LEFTCONTEXT x

token MIDDLE x

token RIGHTCONTEXTx

Transducers implementing nite state markers can b e comp osed with transducers

implementing lters to create a nite state parser whichcanextract and lab el a

wide varietyofnary syntactic dep endency relations b etween words

Wehavecreated a nite state light parser in the following way using the

longest match and replacementoperators transducers are created whichidentify

contiguous noun group and verbal group b oundaries lab eling transducers are

describ ed which mark the nominal or verbal heads within eachgroup and



First describ ed in a dierentnotation in Debili p Recall that represents the

emptystring Therefore a intro duces a into the lowerside string and Bx

the crosspro duct of the language B with the emptystring language eliminates the B strings from the output

Lauri Karttunen and others

ltering transducers are dened whichextract and lab el the syntactic relations

between words within and across group b oundaries

Phrasal Markup

Rulebased descriptions of phrase contours can b e done in a varietyofways the light

parser describ ed here uses a nite state calculus representation of part of sp eech

precedence matrix Debili phrasal descriptions Here is a simple example a

regular expression describing a Frenchnominal phrase using the restriction op erator

to implement the rows and columns of suchaprecedence matrix

NominalPhrase

Art Noun

Noun PAdj Prep

PAdj PAdj Prep

Prep Art Noun

ArtNoun ArtNoun Padj Prep

The expression states that a nominal phrase can contain an article followed by

a noun a noun followed byapostp osed adjectiveorbya prep osition or nothing

means the end of an expression a p ostp osed adjective followed byanother

adjectiveorbyaprep osition etc The nominal group must b egin with an article

or a noun We can derive from suchadenition a phrase marking transducer using

the directed replace op erator and the three dots symbol to mark the

insertion p oints around the longest instances of these groups in eachtaggedsentence

presentininput

MarkNGroup NominalPhrase NG NG

Such phrasal patterns can easily b e written for dierentlanguages or dierent

tagsets or with dierentnounphrase denitions in a given language eg for termi

nology extraction the determiners mightbe omitted in the denition of the noun

phrase pattern Schiller describ es a languageindep endentarchitecture for

nite state noun phrase extraction built on our nite state to ols and taggers

The phrasal markup however can also b e used as a step towards further analysis

leading to syntactic function extraction as shown b elow

Head Marking

Once the group markers are inserted bythe transducers describ ed ab oveinto the

tagged text another transducer places head labels b efore certain classes of words

within the groups These additional lab els indicate that the words app ear in sp ec

ied contexts and make the task of writing syntactic function lters easier and

shorter

Supp osing that NOUN matches words tagged as nouns PREP prep ositions and CC

conjunctions nominal heads are marked as b eing mo died byprep ositions Poras

indep endentnounphraseheadsNbythe transducers pro duced bythe following regular expression

Regular Expressions for Language Engineering

N NG NOUN NOUN NG NG CC

o N P PREP PREP NG

These expressions state that the empty string is replaced with the symbol N

in frontofanoun that is not followed byanother noun in the same noun group This

transducer is comp osed with another whichreplaces the N lab el with a P when

the lab el is preceded byaprep osition in the same noun group with no intervening

prep ositions

Verbal heads are similarly marked with asp ect lab els inside verbal groups For

example PasV marks verb heads in passiveconstructions

Applying the phrasal marking transducers and the head markers over tagged text

inserts the following markings into an example sentence hiding the part of sp eech

tags for legibility

NG Significant N correlations NG VG were PasV obtained VG NG

between the maternal and fetal glucose P levels and the maternal and

fetal ffa P levels NG

Syntactic Function Extraction

Having intro duced nominal and verbal group delimiters and lab eled heads within

these nominal and verbal groups makes writing regular expression lters easier For

example a lter that extracts passivesubjects can b e written as

PassiveSubj x N NOUN

NG VG x PasV VERB x

Applying this lter to the following examples extracts

PassiveSubjcorrelations obtained

Anite state light parser is constructed byaligning the phrase marking trans

ducer the head marker transducer and a union of ltering transducers into one

cascade Weconstructed Grefenstette and ran suchalightnite state parser

whichincluded other lters describing other syntactic patterns over the rst

megabyte of AP news from and randomly chose one hundred output sentences

OnaSPARC tagging the words to ok ab out seconds of real time

inserting noun and verb phrase b oundaries via nonoptimized marking transducers

to ok ab out minutes marking heads inside b oundaries to ok ab out minutes

seconds and applying the union of dierentlters to ok ab out minutes In

all this is ab out words p er minute Manual evaluation showed that

of randomly chosen PassiveSubj relations from these sentences were cor

rect These numb ers can b e improved byintro ducing more complicated lters but

already they provide useful indications of sub categorizations see for example the

passive sub jects extracted for the word killed p eople times seaman

villagers vendors teenagers soldiers reb els pilot patient etc

Conclusion

Wehavepresented numerous examples illustrating the application of the regu

lar expression calculus to language engineering tasks ranging from tokenization to

Lauri Karttunen and others

lightweightsyntactic analysis There are many other typ es of applications that we

have not discussed b ecause of space or b ecause they are already well known such

as morphological analysis bylexical transducers In particular wewould haveliked

to include a fuller discussion of rulebased disambiguation to illustrate the use of re

place expressions in systems suchastheBrill tagger Brill Ro che and Schab es

and the constraintgrammar parser Karlsson et al

Although regular expressions and the algorithms for converting them into nite

state automata havebeen part of elementary computer science for decades the

restriction and replacementexpressions wehavefocused on are recent They have

turned out to b e very useful for linguistic applications Descriptions consisting of

regular expressions can b e eciently compiled into nite state networks whichin

turn can b e determinized minimized sequentialized compressed and optimized in

other ways to reduce the size of the network or to increase the application sp eed

Manyyears of engineering eort have pro duced ecientruntime algorithms for

applying networks to strings

Regular expressions haveaclean declarative semantics Atthesame time they

constitute a kind of high level programming language for manipulating strings lan

guages and relations Although regular grammars can cover only limited subsets

of a natural language there can b e an imp ortantpractical advantage in describ

ing such sublanguages bymeans of regular expressions rather than by some more

powerful formalism Because regular languages and relations can b e enco ded as

nite automata they can b e more easily manipulated than context free and more

complex languages With regular expression op erators new regular languages and

relations can b e derived directly without rewriting the grammars for the sets that

are b eing mo died This is a wellestablished practice in nite state morphology

Our examples in this pap er provide ample evidence of its utilityinother areas of

language engineering

Acknowledgements

Wewould liketo thank Kenneth R Beesley and Annie Zaenen for their editorial

advice Wealso thank Annie Zaenen for manydiscussions ab out the feasibilityof

a regular grammar for valid date expressions and Pasi Tapanainen for his con

tributions to the sections on nite state disambiguation Weacknowledge Ronald

M Kaplan and Martin Kayour Xeroxcolleagues for the pioneering work on the

approachweare taking in this pap er

Regular Expressions for Language Engineering

References

Abney Steven Parsing bychunks In AbneySteven et al eds PrincipleBased

Parsing Kluwer Academic Publishers Dordrecht

App elt Douglas E Hobbs Jerry R Bear John Israel David and Tyson Mabry

FASTUS A nitestate pro cessor for information extraction from realword text Pro

ceedings of IJCAI ChamberyFrance

Beesley Kenneth R and Karttunen Lauri FiniteState Morphology Technical Re

port forthcoming RXRC Grenoble

Brill Eric A simple rulebased partofsp eech tagger Proceedings of ANLP

Trento Italy

Chano d JeanPierre and Tapanainen Pasi A NonDeterministic Tokenizer for

FiniteState Parsing ECAI workshop on Extended Finite State Models of Language

Budap est

Chano d JeanPierre and Tapanainen Pasi FiniteState Based Reductionist Parsing

for French Kornai Andras ed Extended Finite State Models of Language Cambridge

University Press forthcoming

Debili Fathi Analyse SyntaxicoSemantiqueFondee sur une AcquisitionAutoma

tique de Relations LexicalesSemantiques PhD dissertation UniversityofParis XI

France

Grefenstette Gregory Use of syntactic context to pro duce term asso ciation lists for

text retrieval Proceedings of SIGIRCopenhagen Denmark ACM

Grefenstette Gregory LightParsing as FiniteState Filtering In Kornai Andras

ed ExtendedFinite State Models of Language Cambridge UniversityPressforth

coming

Grefenstette Gregory and Tapanainen Pasi What is a word whatisa sentence

Problems of tokenization Proceedings of COMPLEXpp Budap est

Joshi Aravind Computation of Syntactic Structure In Advances in Documentation

and Library Sciencevol I I I part Interscience Publishers

Kaplan Ronald M and KayMartin Regular Mo dels of Phonologica l Rule Systems

Computational Linguistics

Karlsson Fred Voutilainen Atro Heikkila Juha and Anttila Arto eds Con

straint Grammar a languageindependent system for parsing unrestrictedtext Mouton

de Gruyter Berlin and New York

Karttunen Lauri and BeesleyKenneth R TwoLevel Rule Compiler Technical Re

p ort ISTL Octob er XeroxPalo Alto ResearchCenter Palo Alto California

Karttunen Lauri Constructing Lexical Transducers Proceedings of COLING I

pp Kyoto Japan

Karttunen Lauri The Replace Op erator Proceedings of ACL pp Boston

Massachusetts

Karttunen Lauri Directed Replacement Proceedings of ACLSanta Cruz Cal

ifornia

Karttunen Lauri Kaplan Ronald M and Zaenen Annie Twolevel morphology

with comp osition Proceedings of COLING Ipp Nantes France

Kemp e Andre and Karttunen Lauri Parallel ReplacementintheFiniteState Cal

culus Proceedings of COLING pp Cop enhagen Denmark

Koskenniemi Kimmo Twolevel MorphologyAGeneral Computational Mo del for

WordForm Recognition and Pro duction DepartmentofGeneral Linguistics University

of Helsinki Finland

Koskenniemi Kimmo Tapanainen Pasi and Voutilainen Atro Compiling and using

nitestate syntactic rules Proceedings of COLING Ipp Nantes France

Mohri Mehryar On Some Applications of FiniteState to Natural

Language Pro cessing Natural Language Engineering

Lauri Karttunen and others

Palmer David D and Hearst Marti A Adaptivesentence b oundary disambiguati on

Proceedings of ANLP pp Stuttgart Germany

Ro che Emmanuel and Schab es Yves Deterministic PartofSp eechTagging Com

putational Linguistics

Salton Gerald Zhao Zhongnan and BuckleyChris A simple syntactic approach for

the generation of indexing phrases Technical Rep ort DepartmentofComputer

Science Cornell University

Schiller Anne Multilin gual FiniteState Noun Phrase Extraction ECAI Work

shop on ExtendedFinite State Models of Language Budap est

Silb erztein Max Dictionnaireselectroniques et analyse automatique de textes Le

systeme INTEXMasson Paris

Voutilainen Atro and Tapanainen Pasi Ambiguityresolution in a reductionistic

parser Proceedings of EACLpp Utrecht