Compact non-left-recursive grammars using the selective
left-corner transform and factoring
Mark Johnson Brian Roark
Cognitive and Linguistic Sciences, Box 1978
Brown University
Mark [email protected] Brian [email protected]
Left-corner transforms are particularly useful b e- Abstract
cause they can preserve annotations on pro ductions
The left-corner transform removes left-recursion
more on this b elow and are therefore applicable to
from probabilistic context-free grammars and uni-
more complex grammar formalisms as well as CFGs;
cation grammars, p ermitting simple top-down
a prop erty which other approaches to left-recursion
parsing techniques to be used. Unfortunately the
elimination typically lack. For example, they apply
grammars pro duced by the standard left-corner
to left-recursive uni cation-based grammars Mat-
transform are usually much larger than the original.
sumoto et al., 1983; Pereira and Shieb er, 1987; John-
The selective left-corner transform describ ed in this
son, 1998a. Because the emission probabilityofa
pap er pro duces a transformed grammar which simu-
PCFG pro duction can b e regarded as an annotation
lates left-corner recognition of a user-sp eci ed set of
onaCFG pro duction, the left-corner transform can
the original pro ductions, and top-down recognition
pro duce a CFG with weighted pro ductions which
of the others. Combined with two factorizations, it
assigns the same probabilities to strings and trans-
pro duces non-left-recursive grammars that are not
formed trees as the original grammar Abney et al.,
much larger than the original.
1999. However, the transformed grammars can b e
much larger than the original, which is unacceptable
1 Intro duction
for many applications involving large grammars.
Top-down parsing techniques are attractive b ecause
The selective left-corner transform reduces the
of their simplicity, and can often achieve good per-
transformed grammar size b ecause only those pro-
formance in practice Roark and Johnson, 1999.
ductions which app ear in a left-recursive cycle need
However, with a left-recursive grammar such parsers
be recognized left-corner in order to remove left-
typically fail to terminate. The left-corner gram-
recursion. A top-down parser using a grammar pro-
mar transform converts a left-recursive grammar
duced by the selective left-corner transform simu-
into a non-left-recursive one: a top-down parser
lates a generalized left-corner parser Demers, 1977;
using a left-corner transformed grammar simulates
Nijholt, 1980 which recognizes a user-sp eci ed sub-
a left-corner parser using the original grammar
set of the original pro ductions in a left-corner fash-
Rosenkrantz and Lewis I I, 1970; Aho and Ullman,
ion, and the other pro ductions top-down.
1972. However, the left-corner transformed gram-
mar can be signi cantly larger than the original
Although we do not investigate it in this pap er,
grammar, causing numerous problems. For exam-
the selective left-corner transform should usually
ple, we show b elow that a probabilistic context-free
have a smaller search space relative to the standard
grammar PCFG estimated from left-corner trans-
left-corner transform, all else b eing equal. The par-
formed Penn WSJ tree-bank trees exhibits consid-
tial parses pro duced during a top-down parse consist
erably greater sparse data problems than a PCFG
of a single connected tree fragment, while the par-
estimated in the usual manner, simply b ecause the
tial parses pro duced pro duced during a left-corner
left-corner transformed grammar contains approxi-
parse generally consist of several disconnected tree
mately 20 times more pro ductions. The transform
fragments. Since these fragments are only weakly re-
describ ed in this pap er pro duces a grammar approx-
lated via the \link" constraint describ ed b elow, the
imately the same size as the input grammar, which
search for each fragment is relatively indep endent.
is not as adversely a ected by sparse data.
This may be resp onsible for the observation that
exhaustive left-corner parsing is less ecient than
This research was supp orted by NSF awards 9720368,
top-down parsing Covington, 1994. Informally,be-
9870676 and 9812169. Wewould like to thank our colleagues
cause the selective left-corner transform recognizes
in BLLIP Brown Lab oratory for Linguistic Information Pro-
only a subset of pro ductions in a left-corner fashion,
cessing and Bob Mo ore for their helpful comments on this
its partial parses contain fewer tree discontiguous pap er.
fragments and the searchmay b e more ecient.
D D
While this pap er fo cuses on reducing grammar
{
B D A
size to minimize sparse data problems in PCFG
1 0
estimation, the mo di ed left-corner transforms de-
{
D B
n n
scrib ed here are generally applicable wherever the
B
n
LC
original left-corner transform is. For example, the
{
A D B
selective left-corner transform can b e used in place
n 1
of the standard left-corner transform in the con-
{
D D
0
struction of nite-state approximations Johnson,
1998a, often reducing the size of the intermedi-
ate automata constructed. The selective left-corner
Figure 1: Schematic parse trees generated by the origi-
transform can b e generalized to head-corner parsing
nal grammar G and the selective left-corner transformed
van No ord, 1997, yielding a selective head-corner
grammar LC G. The shaded lo cal trees in the orig-
L
parser. This follows from generalizing the selective
inal parse tree corresp ond to left-corner pro ductions;
left-corner transform to Horn clauses.
the corresp onding lo cal trees generated by instances of
After this pap er was accepted for publication we
schema 1c in the selective left-corner transformed tree
learnt of Mo ore 2000, which addresses the issue
are also shown shaded. The lo cal tree colored blackis
of grammar size using very similar techniques to
generated by an instance of schema 1b.
those prop osed here. The goals of the two pap ers
are slightly di erent: Mo ore's approach is designed
{
has b een found left-corner, so D X only
LC G
to reduce the total grammar size i.e., the sum of
L
X . if D
the lengths of the pro ductions, while our approach
G
minimizes the numb er of pro ductions. Mo ore 2000
{
D ! wD w 1a
do es not address left-corner tree-transforms, or ques-
{
D ! D A where A ! 2 P L 1b
tions of sparse data and parsing accuracy that are
{ {
D B ! D C where C ! B 2 L 1c
covered in section 3.
{
D D ! 1d
2 The selective left-corner and
The schemata function as follows. The pro ductions
related transforms
intro duced byschema 1a start a left-corner parse of
a predicted nonterminal D with its leftmost termi-
This section intro duces the selective left-corner
nal w , while those intro duced byschema 1b start a
transform and two additional factorization trans-
left-corner parse of D with a left-corner A, whichis
forms which apply to its output. These transforms
itself found by the top-down recognition of pro duc-
are used in the exp eriments describ ed in the follow-
tion A ! 2 P L. Schema 1c extends the current
ing section. As Mo ore 2000 observes, in general
left-corner B up to a C with the left-corner recogni-
the transforms pro duce a non-left-recursive output
tion of pro duction C ! B . Finally, schema 1d
grammar only if the input grammar G do es not con-
matches the top-down prediction with the recog-
tain unary cycles, i.e., there is no nonterminal A
+
nized left-corner category.
such that A ! A.
G
Figure 1 schematically depicts the relationship b e-
2.1 The selective left-corner transform
tween a chain of left-corner pro ductions in a parse
tree generated by G and the chain of corresp onding
The selective left-corner transform takes as input a
instances of schema 1c. The left-corner recognition
CFG G =V; T ; P; S and a set of left-corner produc-
of the chain starts with the recognition of , the
tions L P , which contains no epsilon pro ductions;
right-hand side of a top-down pro duction A ! ,
the non-left-corner pro ductions P L are called top-
using an instance of schema 1b. The left-branching
down productions. The standard left-corner trans-
chain of left-corner pro ductions corresp onds to a
form is obtained by setting L to the set of all
right-branching chain of instances of schema 1c; the
non-epsilon pro ductions in P . The selective left-
left-corner transform in e ect converts left recursion
corner transform of G with resp ect to L is the CFG
into right recursion. Notice that the top-down pre-
LC G=V ;T;P ;S, where:
L 1 1
dicted category D is passed down this right-recursive
{
chain, e ectively multiplying each left-corner pro-
V = V [fD X : D 2 V; X 2 V [ T g
1
ductions by the p ossible top-down predicted cate-
and P contains all instances of the schemata 1. In gories. The right recursion terminates with an in-
1
these schemata, D 2 V , w 2 T , and lower case stance of schema 1d when the left-corner and top-
{
greek letters range over V [ T . The D X are down categories match.
new nonterminals; informally they enco de a parse Figure 2 shows how top-down pro ductions from
state in whichanD is predicted top-down and an X G are recognized using LC G. When the se- L
top-down as well as left-corner Manning and Car-
::: A ::: A ::: A :::
penter, 1997. Thus investigating the prop erties of
LC
-removal
{
PCFG estimated from trees transformed with T is
A A
L
an easy way of studying sto chastic push-down au-
tomata p erforming generalized left-corner parses.
Figure 2: The recognition of a top-down pro duction A !
{
by LC Ginvolves a left-corner category A A, which 2.3 Pruning useless pro ductions
L
immediately rewrites to . One-step -removal applied
We turn now to the problem of reducing the size of
to LC G pro duces a grammar in which each top-down
L
the grammars pro duced by left-corner transforms.
pro duction A ! corresp onds to a pro duction A !
Many of the pro ductions generated by schemata 1
in the transformed grammar.
are useless, i.e., they never app ear in any termi-
nating derivation. While they can be removed by
lective left-corner transform is followed by a one-
standard metho ds for deleting useless pro ductions
step -removal transform i.e., comp osition or partial
Hop croft and Ullman, 1979, the relationship be-
evaluation of schema 1b with resp ect to schema 1d
tween the parse trees of G and LC G depicted in
L
Johnson, 1998a; Abney and Johnson, 1991; Resnik,
Figure 1 shows how to determine ahead of time the
1992, each top-down pro duction from G app ears
{
new nonterminals D X that can app ear in useful
unchanged in the nal grammar. Full -removal
pro ductions of LC G. This is known as a link con-
L
yields the grammar given by the schemata b elow.
straint.
{
D ! wD w
For PCFGs there is a particularly simple link
+
{
D ! w where D w
constraint: D X app ears in useful pro ductions of
L
? ?
{
D ! D A where A ! 2 P L
X . If LC G only if 9 2 V [ T :D
L
L
?
A; A ! 2 P L D ! where D
epsilon removal is applied to the resulting gram-
L
{ {
{
D B ! D C where C ! B 2 L
mar, D X app ears in useful pro ductions only if
?
+ ?
{
C; C ! B 2 L D B ! where D
X . Thus one only need 9 2 V [ T :D
L
L
generate instances of the left-corner schemata which
Mo ore 2000 intro duces a version of the left-
satisfy the corresp onding link constraints.
corner transform called LC , which applies only to
LR
Mo ore 2000 suggests an additional constrainton
pro ductions with left-recursive parent and left child
{
nonterminals D X that can app ear in useful pro duc-
categories. In the context of the other transforms
tions of LC G: D must either b e the start symbol
L
that Mo ore intro duces, it seems to have the same
of G or else app ear in a pro duction A ! D of G,
e ect in his system as the selective left-corner trans-
+ ?
for any A 2 V , 2fV [ T g and 2fV [ T g .
form do es here.
It is easy to see that the pro ductions that Mo ore's
constraint prohibits are useless. There is one non-
2.2 Selective left-corner tree transforms
terminal in the tree-bank grammar investigated b e-
There is a 1-to-1 corresp ondence b etween the parse
low that has this prop erty, namely LST. However,
trees generated by G and LC G. A tree t is gener-
L
0
in the tree-bank grammar none of the pro ductions
ated by G i there is a corresp onding t generated by
expanding LST are left-recursive in fact, the rst
LC G, where each o ccurrence of a top-down pro-
L
child is always a preterminal, so Mo ore's constraint
duction in the derivation of t corresp onds to exactly
do es not a ect the size of the transformed grammars
one lo cal tree generated by o ccurrence of the cor-
investigated b elow.
resp onding instance of schema 1b in the derivation
0
of t , and each o ccurrence of a left-corner pro duc- While these constraints can dramatically reduce
tion in t corresp onds to exactly one o ccurrence of b oth the numb er of pro ductions and the size of the
0
the corresp onding instance of schema 1c in t . It is parsing search space of the transformed grammar,
straightforward to de ne a 1-to-1 tree transform T in general the transformed grammar LC G can b e
L L
mapping parse trees of G into parse trees of LC G quadratically larger than G. There are two causes
L
Johnson, 1998a; Roark and Johnson, 1999. In the for the explosion in grammar size. First, LC G
L
empirical evaluation b elow, we estimate a PCFG contains an instance of schema 1b for each top-
from the trees obtained by applying T to the trees down pro duction A ! and each D such that
L
?
in the Penn WSJ tree-bank, and compare it to the 9 : D A . Second, LC G contains an in-
L
L
PCFG estimated from the original tree-bank trees. stance of schema 1c for each left-corner pro duction
?
A sto chastic top-down parser using the PCFG es- C ! and each D such that 9 : D C . In
L
timated from the trees pro duced by T simulates e ect, LC G contains one copy of each pro duction
L L
a sto chastic generalized left-corner parser, whichis for each p ossible left-corner ancestor. Section 2.5
a generalization of a standard sto chastic left-corner describ es further factorizations of the pro ductions
parser that p ermits pro ductions to be recognized of LC G which mitigate these causes. L
3a,3b, 1c and 1d. 2.4 Optimal choice of L
?
Because increases monotonically with and
0 L
L
{
D ! A D A where A ! 2 P L 3a
hence L,wetypically reduce the size of LC Gby
0
L
A ! where A ! 2 P L 3b
making the left-corner pro duction set L as small as
p ossible. This section shows how to nd the unique
Notice that the numb er of instances of schema 3a is
minimal set of left-corner pro ductions L such that
less than the square of the numberofnonterminals
LC G is not left-recursive.
L
and that the numb er of instances of schema 3b is the
Assume G = V; T ; P; S is pruned i.e., P con-
numb er of top-down pro ductions; the sum of these
tains no useless pro ductions and that there is no
numb ers is usually much less than the number of
+
A 2 V such that A ! A i.e., G do es not gen-
instances of schema 1b.
G
erate recursive unary branching chains. For rea-
Top-down factoring plays approximately the same
sons of space we also assume that P contains no
role as \non-left-recursion grouping" NLRG do es
-pro ductions, but this approach can b e extended to
in Mo ore's 2000 approach. The ma jor di erence
deal with them if desired. A pro duction A ! B 2
is that NLRG applies to all pro ductions A ! B
? ?
+
P is left-recursive i 9 2 V [ T :B A , i.e.,
in which B is not left-recursive, i.e., 69 : B B ,
P
P
P rewrites B into a string b eginning with A. Let L
0
while in our system top-down factorization applies to
?
b e the set of left-recursive pro ductions in G. Then
those pro ductions for which 69 : B A , i.e., the
P
we claim 1 that LC G is not left-recursive, and
L
pro ductions not directly involved in left recursion.
0
2 that for all L L , LC G is left-recursive.
0 L
The left-corner factorization decomp oses
Claim 1 follows from the fact that if A B
L
schema 1c in a similar way using new nonter-
0
n
then A B and the constraints in section 2.3
P
minals D X , where D 2 V and X 2 V [ T .
lc
on useful pro ductions of LC G. Claim 2 follows
L
0
LC G=V ;T;P ;S, where:
lc lc
L
from the fact that if L L then there is a chain of
0
n
left-recursive pro ductions that includes a top-down
V = V [fD X : D 2 V; X 2 V [ T g
lc 1
pro duction; a simple induction on the length of the
and P contains all instances of the schemata 1a,
lc
chain shows that LC G is left-recursive.
L
1b, 4a, 4b and 1d.
This result justi es the common practice in natu-
ral language left-corner parsing of taking the termi-
n
{ {
D B ! C BD C where C ! B 2 L 4a
nals to b e the preterminal part-of-sp eech tags, rather
n
C B ! where C ! B 2 L 4b
than the lexical items themselves. We did not at-
tempt to calculate the size of such a left-corner gram-
The number of instances of schema 4a is b ounded
mar in the empirical evaluation b elow, but it would
by the numb er of instances of schema 1c and is typ-
be much larger than any of the grammars describ ed
ically much smaller, while the number of instances
there. In fact, if the preterminals are distinct from
of schema 4b is precisely the number of left-corner
the other nonterminals as they are in the tree-bank
pro ductions L.
grammars investigated b elow then L do es not in-
Left-corner factoring seems to corresp ond to one
0
clude any pro ductions b eginning with a preterminal,
step of Mo ore's 2000 \left factor" LF op eration.
and LC G contains no instances of schema 1a at
The left factor op eration constructs new nontermi-
L
0
all. Wenow turn our attention to the other schemata
nals corresp onding to common pre xes of arbitrary
of the selective left-corner grammar transform.
length, while left-corner factoring e ectively only
factors the rst nonterminal symbol on the right
2.5 Factoring the output of LC
L
hand side of left-corner pro ductions. While wehave
This section de nes two factorizations of the output
not done exp eriments, Mo ore's left factor op eration
of the selective left-corner grammar transform that
would seem to reduce the total numberofsymb ols
can dramatically reduce its size. These factoriza-
in the transformed grammar at the exp ense of p os-
tions are most e ective if the numb er of pro ductions
sibly intro ducing additional pro ductions, while our
is much larger than the numberofnonterminals, as
left-corner factoring reduces the numb er of pro duc-
is usually the case with tree-bank grammars.
tions.
The top-down factorization decomp oses
These two factorizations can be used together
schema 1b by intro ducing new nonterminals
in the obvious way to de ne a grammar trans-
0
D , where D 2 V , that have the same expansions
td;lc
, whose pro ductions are de ned by form LC
L
that D do es in G. Using the same interpretation for
schemata 1a,3a, 3b, 4a, 4b and 1d. There are corre-
variables as in schemata 1, if G =V; T ; P; S then
td
sp onding tree transforms, whichwe refer to as T ,
td
L
LC G=V ;T;P ;S, where:
td td
L
etc., b elow. Of course, the pruning constraints de-
0
scrib ed in section 2.3 are applicable with these fac-
V = V [fD : D 2 V g
td 1
torizations, and corresp onding invertible tree trans-
and P contains all instances of the schemata 1a, forms can b e constructed. td
none td lc td; l c
3 Empirical Results
G 15,040
To examine the e ect of the transforms outlined
ab ove, we exp erimented with various PCFGs in-
LC 346,344 30,716
P
duced from sections 2{21 of a mo di ed Penn WSJ
LC 345,272 113,616 254,067 22,411
N
tree-bank as describ ed in Johnson 1998b i.e.,
314,555 103,504 232,415 21,364 LC
L
0
lab els simpli ed to grammatical categories, root
T 20,087 17,146
P
no des added, empty no des and vacuous unary
T 19,619 16,349 19,002 15,732
N
branches deleted, and auxiliaries retagged as aux
T 18,945 16,126 18,437 15,618
L
0
or auxg. We ignored lexical items, and treated
the part-of-sp eech tags as terminals. As Bob Mo ore
Table 1: Sizes of PCFGs inferred using various grammar
p ointed out to us, the left-corner transform may pro-
and tree transforms after pruning with link constraints
duce left-recursive grammars if its input grammar
without epsilon removal. Columns indicate factorization.
In the grammar and tree transforms, P is the set of pro-
contains unary cycles, so we removed them using the
ductions in G i.e., the standard left-corner transform,
a transform that Mo ore suggested. Given an initial
N is the set of all pro ductions in P which do not be-
set of non-epsilon pro ductions P , the transformed
gin with a POS tag, and L is the set of left-recursive
0
grammar contains the following pro ductions, where
\
pro ductions.
the A are new non-terminals:
+
none td lc td; l c
A ! where A ! 2 P; A 6 A
P
+
\ ?
LC 564,430 38,489
A ! D where A D A
P
P
P
+
\
LC 563,295 176,644 411,986 25,335
A ! where A ! 2 P; A A; 6 A
N
P
P
LC 505,435 157,899 371,102 23,566
L
0
This transform can be extended to one on PCFGs
22,035 17,398 T
P
which preserves derivation probabilities. In this sec-
T 21,589 16,688 20,696 15,795
N
tion, we xP to b e the pro ductions that result after
T 21,061 16,566 20,168 15,673
L
0
applying this unary cycle removal transformation to
the tree-bank pro ductions, and G to be the corre-
Table 2: Sizes of PCFGs inferred using various grammar
sp onding grammar.
and tree transforms after pruning with link constraints
with epsilon removal, using the same notation as Table 1.
Tables 1 and 2 give the sizes of selective left-
corner grammar transforms of G for various values
of the left-corner set L and factorizations, without
much larger than L , which in turn is b ecause most
0
and with epsilon-removal resp ectively. In the ta-
pairs of non-POS nonterminals A; B are mutually
bles, L is the set of left-recursive pro ductions in
0
left-recursive.
P , as de ned in section 2.4. N is the set of pro duc-
Turning now to the PCFGs estimated after ap-
tions in P whose left-hand sides do not b egin with
plying tree transforms, we notice that grammar size
a part-of-sp eech POS tag; b ecause POS tags are
do es not increase nearly so dramatically. These
distinct from other nonterminals in the tree-bank,
PCFGs enco de a maximum-likeliho o d estimate of
N is an easily identi ed set of pro ductions guaran-
the state transition probabilities for various sto chas-
teed to include L . The tables also gives the sizes
0
tic generalized left-corner parsers, since a top-down
of maximum-likeliho o d PCFGs estimated from the
parser using these grammars simulates a general-
trees resulting from applying the selective left-corner
ized left-corner parser. The fact that LC G is
P
tree transforms T to the tree-bank, breaking unary
17 times larger than the PCFG inferred after apply-
cycles as describ ed ab ove. For the parsing exp eri-
ing T to the tree-bank means that most of the p os-
P
ments b elowwe always deleted empty no des in the
sible transitions of a standard sto chastic left-corner
output of these tree transforms; this corresp onds to
parser are not observed in the tree-bank training
epsilon removal in the grammar transform.
data. The state of a left-corner parser do es capture
First, note that LC G, the result of applying the
P
some linguistic generalizations Manning and Car-
standard left-corner grammar transform to G, has
penter, 1997; Roark and Johnson, 1999, but one
approximately 20 times the number of pro ductions
might still exp ect sparse-data problems. Note that
td;lc
that G has. However LC td;lc td;lc G, the result of ap-
L
0
LC is only 1.4 times larger than T ,sowe
L L
0 0
plying the selective left-corner grammar transforma-
exp ect less serious sparse data problems with the
tion with factorization, has approximately 1.4 times
factored selective left-corner transform.
the number of pro ductions that G has. Thus the
We quantify these sparse data problems in two
metho ds describ ed in this pap er can in fact dramati-
ways using a held-out test corpus, viz., all sentences
cally reduce the size of left-corner transformed gram-
in section 23 of the tree-bank. First, table 3 lists the
td;lc
mars. Second, note that LC G is not much
N
number of sentences in the test corpus that fail to
td;lc
receive a parse with the various PCFGs mentioned larger than LC G. This is b ecause N is not
L 0
Transform none td lc td; l c none td lc td; l c
none 0 none 70.8,75.3
T 2 0 T 75.8,77.7 74.8,76.9
P P
T T 2 0 2 0 75.8,77.6 73.8,75.8 75.5,77.8 72.8,75.4
N N
0 0 0 0 75.8,77.4 73.0,74.7 75.6,77.8 72.9,75.4 T T
L L
0 0
Table 3: The numb er of sentences in section 23 that do Table 5: Lab elled recall and precision scores of PCFGs
not receive a parse using various grammars estimated estimated using various tree-transforms in a transform-
from sections 2{21. detransform framework using test data from section 23.
none td lc td; l c Transform
the transform-detransform framework describ ed in
none 514
Johnson 1998b to evaluate the parses, i.e., we ap-
665 535 T
P