University of Pennsylvania ScholarlyCommons

IRCS Technical Reports Series Institute for Research in Cognitive Science

April 1995

A Freely Available Syntactic Lexicon for English

Dania Egedi University of Pennsylvania

Paul Martin SRA

Follow this and additional works at: https://repository.upenn.edu/ircs_reports

Egedi, Dania and Martin, Paul, "A Freely Available Syntactic Lexicon for English" (1995). IRCS Technical Reports Series. 125. https://repository.upenn.edu/ircs_reports/125

University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-95-11.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/125 For more information, please contact [email protected]. A Freely Available Syntactic Lexicon for English

Abstract This paper presents a syntactic lexicon for English that was originally derived from the Oxford Advanced Learner's Dictionary and the Oxford Dictionary of Current Idiomatic English, and then modified and augmented by hand. There are more than 37,000 syntactic entries from all 8 parts of speech. An X- windows based tool is available for maintaining the lexicon and performing searches. C and Lisp hooks are also available so that the lexicon can be easily utilized by parsers and other programs.

Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-95-11.

This technical report is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/125 The Institute For Research In Cognitive Science

A Freely Available Syntactic Lexicon for English

by

Dania Egedi IRCS P Patrick Martin SRA E University of Pennsylvania 3401 Walnut Street, Suite 400C Philadelphia, PA 19104-6228

April 1995 N (originally published in August 1994)

Site of the NSF Science and Technology Center for Research in Cognitive Science N

University of Pennsylvania IRCS Report 95-11 Founded by Benjamin Franklin in 1740

App ears in the Proceedings of the International Workshop on Sharable Natural Language Resources

Nara Japan August pp

AFreely Available Syntactic Lexicon for English

Dania Egedi and Patrick Martin

Institute for Research in CognitiveScience

UniversityofPennsylvania

Philadelphia PA USA

fegedimartingunagicisup ennedu

Abstract

This pap er presents a syntactic lexicon for English that was originally derived from the

Oxford Advanced Learners Dictionary and the Oxford Dictionary of Current Idiomatic

English and then mo died and augmented by hand There are more than syn

tactic entries from all parts of sp eech An Xwindows based to ol is available for main

taining the lexicon and p erforming searches C and Lisp ho oks are also available so that

the lexicon can b e easily utilized by parsers and other programs

consistencies in the various comp onents of the

Intro duction

lexical entries making extraction quite di

One of the central needs of any widecoverage

cult Many researchers abandon the extrac

tion pro cess altogether b ecause it consumes to o parser is a large lexicon that contains the syn

tactic information for various lexical items

many scarce resources

The creation of such a lexicon has tradition

Although a numb er of researchers haveex

ally b een a very large and daunting task and

tracted information out of the various dictio

naries available the resulting lexicons have most universities have shied away from it leav

not in general b een made freely available ing the creation of widecoverage parsers to

commercial institutions that could aord the

to the NLP research community In at

time and p ersonnel to devote to the creation of

least some cases Carroll and Grover

such a lexicon The release of several machine

Guthrie et al this is due to licensing

restrictions on the source dictionaries In re readable dictionaries MRDs into the public

domain has op ened new p ossibilities to gram

sp onse to the related problems of duplication of

mar develop ers at research institutions but

eort and nonavailability of needed lexicons

the task did not b ecome trivial The problem

there are currently several ongoing pro jects to

of creating large scale lexicons changed from

create syntactic lexicons and make them gen

the tiresome painstaking task of trying to de erally available

velop individual word lists for various syntactic

The Proteus Pro ject at New York Uni

phenomena to the task of simply extracting

versity is developing the Comlex Syntac

the information from the online dictionaries

tic Dictionary from scratch for release as

This however has not turned out to b e as sim

one of the lexical resources in COMLEX

ple or straightforward as researchers mayhave

available through the Linguistic Data

hop ed Machine readable dictionaries present

Consortium Macleo d et al

numerous problems in terms of errors and in

Currently at SRA Arlington VA USA

The I ITLEX pro ject at Illinois Institute

martinpsracom

of Technology has an ongoing pro ject

ex eld optional may b e used for any to extract and release the information

numb er of example sentences in the Collins English Dictionaryalong

with information from various other word

Note that lexical items mayhave more than

lists that will include b oth syntactic and

one entry in the database eg have and that

semantic information That system is

they may select the same frame eld more

still under development however and

than once using the fs to capture lexical id

currently uses an exp ensive relational

iosyncrasies eg map Table shows selected

database package a drawbackwhich they

entries from the database

plan to correct Conlon

INDEX have

The syntactic lexicon describ ed here con

ENTRY have

tains approximately entries extracted

POS

from the OxfordAdvancedLearners Dictio

FRAME

nary of Current English Hornby and the

FS Go es on Innitive

Oxford Dictionary for Current Idiomatic En

EX John has to go to the store

glish Cowie and Mackin It is available

via FTP in b oth an ASCI I and a database for

INDEX have

mat The database format uses a UNIX hash

ENTRY have

table facility Seltzer and Yigit that is

POS V

freely distributed and comes with an X

FRAME

windows based interface for mo difying the

FS NonErgative

database and doing searches C and Lisp ho oks

EX Johnhasaproblem

to allow other programs to use the database are

also included

INDEX map

ENTRY map out

Syntactic Lexicon

Particle POS Verb Verb

Verb Particle FRAME Transitive

The syntactic lexicon has entries for part

ofsp eech categories Com

INDEX map

plementizer

ENTRY map

erb Eachentry consists of Prep osition and V

POS Noun

the following required and optional elds

FRAME Base Noun

Determiner required Noun

index eld required the uninected

Noun Mo dier

form under which the lexical item is com

FS whreexive

piled in the database

INDEX map

entry eld required contains all of the

ENTRY map

lexical items asso ciated with the index

POS Noun

pos eld required gives the partof

Determiner not required FRAME Noun

sp eech for the lexical items in the entry

FS whreexive plural

eld

Table Selected Syntactic Database Entries

frame eld required contains the syn

tactic information ab out that entry

Because the syntactic database is part of the

XTAG pro ject Doran et al a ongoing

fs eld optional the Feature Structure

pro ject to develop a widecoverage parser for

eld may provide additional information

English see Section some entries in the syn

ab out the frame eld

tactic lexicon reect sp ecic XTAG analyses

For example a verb particle construction would b e

In fact the graphical interface for the syntac

indexed under the verb but would contain b oth the

tic lexicon describ ed in Section can run in

verb and the verb particle in the entry eld

in predicative sentences Other frames provide two mo des xtag and verb oseTables

information ab out the use of the noun with and were all generated in verb ose mo de

when forming noun phrases The The vast ma jority of lexical items in the

frames for noun are presented b elow database fall into just categories

Nouns and These three categories plus

Base noun All

Adverbs are presented in more detail in the fol

lowing subsections

Noun Phrase with Determiner

Nouns that can take a determiner when

Adjectives

forming a noun phrase Ex a mana

jealousy

There are lexical adjectives in the

database of which are Prop er Name adjec

Noun Phrase without Determiner

tives suchas Chinese and American Adjec

Nouns that can app ear without a deter

tives have frames that they can select which

miner when forming a noun phrase Ex

are listed b elow Possible values for the fs eld

envyplant

are wh and wh

Mo difying noun Nouns that can mo d

Base adjective All adjectives

ify other nouns Note that not all nouns

can mo dify other nouns Prop er nouns in

Mo difying adjective Adjectives that

general cannot mo dify other nouns and

can o ccur in direct mo dication contexts

sp ecic lexical items may b e restricted as

Ex the Chinese man

well Ex basketball gameJohn car

Predicative adjective Adjectives that

Noun with sentential complement

can o ccur as the complementofapredica

Nouns that takesentential complements

tiveverb Ex John was happy

Ex the fact that Mary loves John

e adjective w sentential Predicativ

Predicative noun Nouns that can o ccur

complement Adjectives that can o ccur

as the complement of a predicativeverb

as the complement of a predicativeverb

Ex John was a man

and that take a sentential complement

Ex John was happy that Mary left Bil l

tential sub Predicativenounwsen

ject Nouns that can o ccur as the comple

Predicative adjective w sentential

ment of a predicativeverb and that take

sub ject Adjectives that can o ccur as the

a sentential sub ject Ex That John loves

complement of a predicativeverb and that

Mary is a crime

takea sentential sub ject Ex That John

loves Mary is great

Because this lexicon is used in the XTAG

system the lexicon often indicates precise syn

tactic b ehavior rather than simply placing a

Nouns

general lab el on a lexical item For the class

Nouns are by far the largest category in the

of nouns this is seen in the sp ecication of

syntactic database accounting for well over

nouns with resp ect to their coo ccurrence with

of the entries Prop er nouns and

determiners Instead of assigning a general la

b oth have the partofsp eech Noun Prop er

b el as as common noun or the

names suchas Daniel le and Nicholas are

noun frames explicitly indicate whether certain

not wellrepresented in the database but geo

forms of the noun can app ear with or with

graphic names particularly places in England

out a determiner However since the syntac



The frames for nouns are simi generally are

tic database is indexed on ro ot forms onlythe

lar in manyways to the frames for adjectives

morphology of the lexical item is not avail

since nouns can mo dify other nouns and o ccur

able Instead the FS eld is used to indicate



any restrictions on a particular use of a lexical

This reects the origin of the dictionary from which

the lexicon was originally extracted item For example in Table the noun map

nd all the verbs that take innitive comple o ccurs twice The rst time that it app ears

Determiner required ments one can simply searchontheInni it selects the Noun

frame The feature structures asso ciated with tive Complement feature structure rather

it indicates only that the noun is not a wh than having to sp ecify each frame that could

word and that it is not reexive No re ll this role Table shows some values for var

strictions are made with resp ect to its mor ious verbs that take sentential complements

phologyIncontrast the second entrywhich

INDEX want

selects the Noun Determiner not required

ENTRY want

has plural as part of its FS This indicates

POS Verb

that the noun for this frame is restricted to its

FRAME Sentential Complement

plural form Hence map can only o ccur with

FS Innitive Complement

a determiner but maps is free to o ccur b oth

EX Dan wants to nish this pap er

with or without one Nouns that b elong to the

class of socalled mass nouns would not have

INDEX want

the plural restriction on the entry that selects

ENTRY want

the Noun Determiner not required frame

POS Verb

thereby indicating that the singular form is also

FRAME NP and Sentential Complement

allowed to o ccur without a determiner

FS Innitive Complement

EX Dan wants Al to nish this pap er

Verbs

INDEX think

Verbs with their varied sub categorization

ENTRY think

frames are p erhaps the most interesting lexi

POS Verb

cal items in a syntactic lexicon There are over

Complement FRAME Sentential

verbs not including auxiliary verbs that

FS Indicative Complement

make up almost entries in the database

EX Dan thought that the pap er was done

There are dierent frames that the verbs

can select including transitive intransitive

INDEX think

sentential complement sentential sub ject verb

ENTRY think

particle constructions transitive and intransi

POS Verb

tive double ob jects with shifting double ob

FRAME Sentential Complement

jects without shifting and lightverb construc

Complement FS Innitive

tions

EX Doug thought to clean the kitchen

As with the nouns the FS eld is used

to provide a more concise format for sp eci

INDEX think

fying the frames for each lexical item For

ENTRY think

the verbs the FS eld is used to sp ec

POS Verb

ify the dierence b etween ergative and non

FRAME Sentential Complement

ergative transitive verbs as can be seen

FS Predicative Complement

in the have entry in Table and is also

EX Dan thought Carl a jerk

used heavily for further dierentiating the

tential com frames for verbs that takesen

Table Verbs with Sentential Complements

plements There are two frames for senten

tial complements Sentential Complement

and NP and Sentential ComplementEi

Auxiliary verbs

ther of these can occur with the feature

structures Innitive The lexical entries for auxiliary verbs are very Complement Indica

Complementor Predicative Comp closely tied to the XTAG analysis whichor tive

lement This reduces the number of values ders the auxiliary verbs based on their mor

for FRAME that are necessary to cover all of phological forms Eachentry in the lexicon

the p ossible lexical environments and also al is restricted via the FS eld to only a cer

lows for easier searches across categories To tain form of the auxiliary verb present past

INDEX ahead

ppart etc which also indicates what other



ENTRY ahead

forms that it can go on Table shows the

POS Adverb

entries for the auxiliary verbs for the sentence

Adverb FRAME Base

John should have been waiting

PostVP

INDEX should

PrePP

ENTRY should

POS Verb

INDEX essentially

FRAME Auxiliary Verb

ENTRY essentially

FS Indicative Present Go es on Base

POS Adverb

TREES Base Adverb

INDEX have

PreVP

ENTRY have

PreS

POS Verb

PostS

FRAME Auxiliary Verb

on Past FS Base Go es

INDEX even

ENTRY even

INDEX be

POS Adverb

ENTRY be

FRAME Base Adverb

POS Verb

PreVP

FRAME Auxiliary Verb

PreAdj

Participle Go es on FS Past

PreNoun

PrePP

Table Example Auxiliary Verb Entries

INDEX very

ENTRY very

POS Adverb

FRAME Base Adverb

A syntactic lexicon for adverbs is particularly

PreAdj

useful b ecause adverbs are so idiosyncratic as

PreAdv

to where they can o ccur in a sentence Al

though there are only adverbs in the syn

Table Some Adverb Lexical Entries

tactic lexicon but there are dierent frame

values that they can select These include basic

The hashed database format is very useful

adverb pre and p ost verb phrases pre and p ost

for programs that need quick access to the in

sentences pre and p ost adjective preadverb

formation in the database Eachentry is in

preprep osition prenoun etc Table shows

dexed under the index key and a single call

some selected adverb entries

to the database for a particular index returns

all of the entries that share that index This

makes it particular useful for parsers The

File Formats

database uses an enco ding scheme for the pos

frame and FS elds which condenses the

The information in the syntactic database is

space required for the database and shortens

available b oth in an ASCI I at le and a

the search time for nonindex elds All of the

hashed database format The ASCI I le con

entries for a given lexical index can b e retrieved

tains one entry p er line and each eld is clearly

in msecs on average

marked This format is easily usable byvari

tm

utilities suchas grep and awkand ous UNIX

it can b e easily parsed by custom programs

Interface



For a more detailed description of this and other

Although the format of the at le is ex

XTAG analyses please see the XTAGTechnical Rep ort

The XTAG Pro ject

cellent for various le utilities programs and

the database format works well for retriev

ing entries quickly neither is particularly well

suited for human readability The Xwindows



interface for the syntactic database allows

users to easily lo ok at the database Search

ing is available not only on the index under

which the lexical item is stored but also on all

other elds with the exception of the ex eld



Searches may also b e done on combinations

of elds For instance one could searchon

POS Noun and FS wh to nd the set

of all wh nouns what who whom which

when Figure shows the interface after a

search has b een done on the index needAllof

the entries with that index are listed in a scroll

window which can b e browsed through using

the Next and Previous buttons or sp ecic

entries can b e clicked on and the entire record

will show in the upp er window The results of

searches can b e saved to a le to create smaller

custom lexicons In addition to searching the

database users can also easily add delete and

mo dify individual entries tailoring the syntac

Figure Result of a search on the index need

tic database to t their needs Users mayalso

en search and delete all entries found in a giv

Number Total Percent

we hop e to add the capacity to mo dify a entire

Corpus of Hits of Words Hit

set of entries in the future

WSJ

Brown

Statistics

IBM

ATIS

Statistics were gathered on the coverage of

the syntactic lexicon on the IBM ATIS

Table Percentage of Hits for various corp ora

WSJ and Brown corp ora These corp ora

were chosen b ecause they have b een tagged

syntactic lexicon Items that were not found

and hand corrected bytheTreeBank pro ject

in the morphological database were counted

Santorini The data in Table show

against the syntactic lexicon as the morphol

the coverage of the lexicon on various corp ora

ogy database is a sup erset of the syntactic

A lexical itempartofsp eech pair is counted



database The statistics in Table are over

as a hit if the lexical item is in the syntac



all word o ccurrences in the corp ora sowords

tic lexicon with the indicated tag No attempt

that o ccur frequently are given more weight

was made to determine if the lexicon had the



Not surprisingly nouns and prop er nouns

correct frame needed to parse the sentence



Because the syntactic lexicon contains only

Because these databases are b eing used in an actual

the ro ot form of lexical entries the inected parser an attempt was made some time ago to make

ensure that all words in the syntactic lexicon app ear in

form was rst lo oked up in the morphol

the morphological database Although the databases

ogy database Karp et al to retrieve the

mayhave diverged slightly since then it should not b e

ro ot form and then that was used for the

statistically signicant



Numb ers and the genitivemarker s were taken



The interface uses the MIT Athena To olkit which

out b efore the statistics were compiled

is distributed with the standard MIT X release



Although we do not distinguish nouns and prop er



We hop e to add expand this in the future to include

nouns in the syntactic lexicon the TreeBank tags do

full regular expression searches

Number of Percent Percent Percent Percent Percent

Corpus Nonhits Prop er N Nouns Adj Adv Verbs

WSJ

Brown

IBM

ATIS

Table Percentage of missing words for various Parts of Sp eech

comprise the largest category of words missed

Related Work

followed by adjective adverbs and verbs Ta

ble shows the p ercentage of each of these cat The syntactic lexicon was develop ed as part

egories in the list of items not found Again of the XTAG pro ject Doran et alat

this is a p ercentage of word o ccurrences in the the UniversityofPennsylvania under the di

corp ora rection of Dr Aravind Joshi The XTAGsys

As Table indicates the ma jority of the tem is a widecoverage parser and grammar for

missing items are either nouns or prop er nouns English based on the Tree Adjoining Gram

mar TAG formalism Joshi et al The

This is not surprising nor

particularly distressing as nouns tend to b e consists of sections a

the easiest items to guess information ab out morphology database a syntactic database

Verbs which tend to b e the hardest are rea and a tree grammar Together with a

sonably wellcovered in this lexicon The num parser and an Xwindows interface they

b er of adjectives not covered however seems comprise the XTAG system Both the

fairly high and we plan to add a number of morphology Karp et al and syntactic

those missing to the syntactic lexicon databases are available separately The en

tire XTAG system is also freely available to

the NLP research community Information

Future Work

ab out the entire XTAG system and FTP in

structions may b e obtained by writing xtag

The lexicon in its present form do es not pro

requestlinccisup ennedu

vide a mechanism to sp ecify preferences of lex

ical items for certain syntactic structures As

part of future enhancements to the lexicon we

Computer Platform

hop e to asso ciate probabilities with eachentry

The probabilities will reect the anity of the The syntactic lexicon and accompanying inter

elop ed on the Sun SPARC station face were dev

lexical item for the syntactic structure asso ci

ated with that entry These probabilities will series as were the other to ols mentioned in Sec

b e computed from parsed corp ora tion All of the XTAG to ols including the

It has b een observed quite conclusively in syntactic lexicon and interface are freely avail

recentwork in lexicography that certain com able without limitation through anonymous

FTP to ftpcisup ennedu The syntactic

binations of words coo ccur more often than

would b e exp ected if they corresp onded to ar lexicon and accompanying programs together

bitrary usages of the individual words Collo require ab out MB of space for b oth the

cational information has b een shown to b e of ASCI I and DB versions of the lexicon Please

immense use in pruning the search space for a send mail to lexrequestlinccisup ennedu for

parser W current FTP instructions or for more informa e hop e to eventually extract collo ca

tion tional information from the corp ora and make

it a part of the syntactic lexicon

make this distinction and it seemed useful to continue

this distinction for this part of the analysis

Macleo d et al Macleo d C Grishman

References

R and Meyers A Creating a

Carroll and Grover Carroll J and Common Syntactic Dictionary of English

Grover C The Derivation of a In Proceedings of the International Work

Large Computational Lexicon for English shop on Sharable Natural Language Re

from LDOCE In B Boguraev and E sources Nara Japan August

Brisco e eds Computational Lexicogra

Santorini Santorini B Part

phy for Natural Language ProcessingHar

ofSpeech Tagging Guidelines for the Penn

low UK Longman

TreeBank Project Technical rep ort MS

Conlon Conlon S PinHgern CIS Department of Computer and

The I IT Lexical Database Dream and Information Science UniversityofPenn

Reality In A Zamp olli N Calzolari M sylvania

Palmer eds Current Issues in Com

Seltzer and Yigit Seltzer M and

putational Linguistics In Honour of Don

Yigit O A new hashing pack

Walker Giardini with Kluwer

age for UNIX In USENIXWinter

Cowie and Mackin Cowie AP and

The XTAG Pro ject

Mackin R eds Oxford Dictio

The XTAG Pro ject AFeature

nary of Current Idiomatic EnglishVol

Based LexicalizedTreeAdjoining Gram

ume Oxford University Press London

mar FBLTAG for English Manuscript

Doran et al Doran C Egedi D UniversityofPennsylvania In Progress

Ho ckey BA Srinivas B Zaidel M

XTAG System A Wide Cov

erage Grammar for English In Proceed

th

ings of the International Conference

on Computational Linguistics COLING

Kyoy o Japan August

Guthrie et al Guthrie L Rauls V

Luo T Bruce R LEXI

CADCAM Technical Rep ort MCCS

Computing Research Lab oratory

New Mexico State University

Hornby Hornby A S ed

OxfordAdvancedLearners Dictionary of

Current English Third Edition Oxford

University Press London

Joshi et al Joshi A Levy L and

Takahashi M Tree Adjunct

Grammars Journal of Computer and Sys

tem Sciences

Karp et al Karp D Schab es Y

Zaidel M Egedi D A

Freely Available Wide Coverage Morpho

logical Analyzer for English In Proceed

th

International Conference ings of the

on Computational Linguistics COLING

Nantes F rance August