Automated Construction of Database Interfaces: Integrating Statistical and Relational Learning for Semantic Parsing

Lappoon R. Tang and Raymond J. Mooney Department of Computer Sciences University of Texas at Austin Austin, TX 78712-1188 {rupert, mooney}@cs, utexas, edu

Abstract ping airline-information queries into SQL pre- The development of natural language inter- sented in (Miller et al., 1996), a probabilistic faces (NLI's) for databases has been a chal- decision-tree method for the same task de- lenging problem in natural language process- scribed in (Kuhn and De Mori, 1995), and ing (NLP) since the 1970's. The need for an approach using relational learning (a.k.a. NLI's has become more pronounced due to the inductive logic programming, ILP) to learn a widespread access to complex databases now logic-based semantic parser described in (Zelle available through the Internet. A challenging and Mooney, 1996). problem for empirical NLP is the automated The existing empirical systems for this task acquisition of NLI's from training examples. employ either a purely logical or purely sta- We present a method for integrating statisti- tistical approach. The former uses a deter- cal and relational learning techniques for this ministic parser, which can suffer from some task which exploits the strength of both ap- of the same robustness problems as rational- proaches. Experimental results from three dif- ist methods. The latter constructs a prob- ferent domains suggest that such an approach abilistic grammar, which requires supplying is more robust than a previous purely logic- a sytactic parse tree as well as a semantic based approach. representation for each training sentence, and requires hand-crafting a small set of contex- 1 Introduction tual features on which to condition the parameters of the model. Combining relational We use the term semantic parsing to refer and statistical approaches can overcome the to the process of mapping a natural language need to supply parse-trees and hand-crafted sentence to a structured meaning representa- features while retaining the robustness of station. One interesting application of semantic tistical parsing. The current work is based parsing is building natural language interfaces on the CHILL logic-based parser-acquisition for online databases. The need for such appli- framework (Zelle and Mooney, 1996), retain- cations is growing since when information is ing access to the complete parse state for mak- delivered through the Internet, most users do ing decisions, but building a probabilistic re- not know the underlying database access lan- lational model that allows for statistical pars- guage. An example of such an interface that ing- we have developed is shown in Figure 1. Traditional (rationalist) approaches to con- 2 Overview of the Approach structing database interfaces require an ex- This section reviews our overall approach pert to hand-craft an appropriate semantic using an interface developed for a U.S. parser (Woods, 1970; Hendrix et al., 1978). Geography database (Geoquery) as a However, such hand-crafted parsers are time sample application (ZeUe and Mooney, consllming to develop and suffer from prob- 1996) which is available on the Web (see lems with robustness and incompleteness even hl:tp://gvg, cs. utezas, edu/users/n~./geo .html). for domain specific applications. Neverthe- less, very little research in empirical NLP has 2.1 Semantic Representation explored the task of automatically acquiring First-order logic is used as a semantic repre- such interfaces from annotated training ex- sentation language. CHILL has also been ap- amples. The only exceptions of which we plied to a restaurant database in which the are aware axe a statistical approach to map- logical form resembles SQL, and is translated

133 QUERY YOU PO~TED: all a goo~ ~ z~caL~ ~m ~o ~.t~o'P

RE~UI.T:

Damba~

~ooa~e~ I,~ p.~.,.~r ~,~o~o ~ J ~u~,,o~",,~u.,. p~o ~.~.,,¢Bo ~ ~,.~.o ,~.~o ~.~ I

a~ooo~z~u~r~ ~ ~rr~r ~,~o~.~o ~ I

THE SOL GENERATED:

~n0~t ~.K~t ~Fo, LOCAnONS C~*tl~rOJt~ ~ Z3 AgO

Figure 1: Screenshots of a Learned Web-based NL Database Interface

automatically into SQL (see Figure 1). We What statehas the most riversrunning through it? explain the features of the Geoquery repre- a nswer(S,most (S,R,(state(S),rlver(R),traverse(R,S)))). sentation language through a sample query: Input: "W'hatis the largestcity in Texas?" 2.2 Parsing Actions Quc~'y: answer(C,largest(C,(city(C),loc(C,S), const (S,stateid (texas))))). Our semantic parser employs a shift-reduce architecture that maintains a stack of pre- Objects are represented as logical terms and viously built semantic constituents and a are typed with a semantic category using buffer of remaining words in the input. The logical functions applied to possibly ambigu- stateid(Mississippi), parsing actions are automatically generated ous English constants (e.g. from templates given the training data. The riverid(Mississippi)). Relationships between ob- templates are INTRODUCE, COREF_VABS, jects are expressed using predicates; for in- DROP_CONJ, LIFT_CON J, and SttIFT. IN- stance, Ioc(X,Y) states that X is located in Y. TRODUCE pushes a predicate onto the stack We also need to handle quantifiers such based on a word appearing in the input and as 'largest'. We represent these using meta- information about its possible meanings in predicates for which at least one argument is a conjunction ofliterals. For example, largest(X, the lexicon. COREF_VARS binds two arguments of two differentpredicates on the stack. Goal) states that the object X satisfies Goal DROP_CONJ (or LIFT_CON J) takes a pred- and is the largest object that does so, using icate on the stack and puts it into one of the the appropriate measure of size for objects of arguments of a meta-predicate on the stack. its type (e.g. area for states, population for SHIFT simply pushes a word from the input cities). Finally, an nn.qpeci~ed object required buffer onto the stack. The parsing actions are as an argument to a predicate can appear else- tried in exactly this order. The parser also where in the sentence, requiring the use of the requires a lexicon to map phrases in the in- predicate const(X,C) to bind the variable X to put into specific predicates, this lexicon can the constant C. Some other database queries also be learned automatically from the train- (or training examples) for the U.S. Geography ing data (Thompson and Mooney, 1999). domain are shown below: Let's go through a simple trace of parsing What is the capitalof Texas? the request "What is the capital of Texas?" a nswer(C,(ca pital(C,S),const(S,stateid (texas)))). A lexicon that maps 'capital' to 'capital(_,_)' and 'Texas' to 'const(_,stateid(texas))' su.~ces

134 here. Interrogatives like "what" may be negative example for an action if it is a posi- mapped to predicates in the lexicon if neces- tive example for another action below the cur- sary. The parser begins with an initial stack rent one in the ordered list of actions. Control and a buffer holding the input sentence. Each conditions which decide the correct action for predicate on the parse stack has an attached a given parse state axe learned for each action buffer to hold the context in which it was from these positive and negative examples. introduced; words from the input sentence The initial CHILL system used ILP (Lavrac are shifted onto this buffer during parsing. and Dzeroski, 1994) to learn Prolog control The initial parse state is shown below: rules and employed deterministic parsing, us- Parse Stack: [answer(_,_):O] ing the learned rules to decide the appropriate Input Buffer: [what,is,the,ca pital,of,texas,?] parse action for each state. The current approach learns a model for estimating the prob- Since the first three words in the input ability that each action should be applied to buffer do not map to any predicates, three a given state, and employs statistical parsing SHIFT actions are performed. The next is an (Manning and Schiitze, 1999) to try to find INTRODUCE as 'capital' is at the head of the overall most probable parse, using beam the input buffer: search to control the complexity. The advan- Parse Stack: [capital(_,_): O, answer(_,_):[the,is,what]] tage of ILP is that it can perform induction Input Buffer: [capital,of,texas,?] over the logical description of the complete parse state without the need to pre-engineer a The next action is a COREF_VARS that fixed set of features (which vary greatly from binds the first argument of capital(_,_) with one domain to another) that are relevant to the firstargument of answer(_,_). making decisions. We maintain this advan- tage by using ILP to learn a committee of Parse Stack: [capital(C,_):O, answer(C,_):[the,is,what]] Input Buffer: [capital,of,texas,?] hypotheses, and basing probability estimates on a weighted vote of them (Ali and Pazzani, 1996). We believe that using such a proba- The next sequence of steps axe two SHIFT's, bilistic relational model (Getoor and Jensen, an INTRODUCE, and then a COR.EF_VARS: 2000) combines the advantages of frameworks based on first-orderlogic and those based on Parse Stack: [const(S,stateid(texas)):0' standard statisticaltechniques. ca pital(C,S):[of, ca pital], answer(C,_):[the,is,what~ Input Buffer: [texas,?] 3 The TABULATE ILP Method The last four steps are two DROP_CONJ's This section discusses the ILP method used to followed by two SHIFT's: build a committee of logical control hypotheses for each action. Parse Stack: [answer(C, (capital(C,S), const(S,stateld(texas)))): 3.1 The Basic TABULATE Algorithm [?,texas,of, ca pital,the,is,what]] Input Buffer: I] Most ILP methods use a set-covering method to learn one clause (rule) at a time and con- This is the final state and the logical query is struct clauses using either a strictly top-down extracted from the stack. (general to specific) or bottom-up (specific to 2.3 Learning Control Rules general) search through the space of possible rules (Lavrac and Dzeroski, 1994). TAB- The initiallyconstructed parser has no con- ULATE,1 on the other hand, employs both straints on when to apply actions, and is bottom-up and top-down methods to con- therefore overly general and generates n11rner- struct potential clauses and searches through ous spurious parses. Positiveand negative ex- the hypothesis space of complete logic pro- amples are collectedfor each action by parsing grams (i.e. sets of clauses called theories). It each tralnlng example and recordlng the parse uses beam search to find a set of alternative states encountered. Parse states to which an hypotheses guided by a theory evaluation met- action should be applied (i.e. the action leads ric discussed below. The search starts with to building the correct semantic representation) are labeled positive examples for that aTABULATB stands for Top-doera And Bottom-Up action. Otherwise, a parse state is labeled a cLAuse construction urith Theory Evaluation.

135 retained. There are three possible outcomes Procedure Tabulate Input: in a refinement: 1) the current clause satisfies t(X,,...,Xn): the target concept to learn the noise-handling criterion and is simply re- ~+: the (B examples turned (nextj is set to empty), 2) the current ~-: the (9 examples clause does not satisfy the noise-handling cri- Output: teria and all possible refinements are returned Q: a queue of learned theories (neztj is set to the refined clause), and 3) Theoryo := {E '¢'-I E E ~+} /* the initial theory */ the current clause does not satisfy the noise- T(No) := Theoryo /* theory of node No */ handling criterion but there are no further re- C(No) := empty /* the clause being built */ finements (neztj is set to fai O. If the refine- Q := [No] /* the search queue */ ment is a new clause, clauses in the current Repeat CO ¢ theory subs-reed by it are removed. Oth- Fo_._Xeach search node Ni E Q D__q erwise, it is a specialization of an existing C(Ni) = empty or C(Ni) = fail Then clause. Positive examples that are not cov- Pairs := sampling of S pairs of clauses from T(N~) ered by the resulting theory, due to special- Find LGG G in Pairs with the greatest cover in ~+ izing the clause, are added back into theory Ri := Refine_Clause(t(X1,... ,Xn) +-) U Refine_Clause( G ~--) as individual clauses. Hence, the theory is al- Else ways maintained complete (i.e. covering all R4 := Reflne_Clause(C(Ni)) positive examples). These final steps are per- End If formed in the Complete procedure. If Ri ----¢ Then CQ, := {(T(N,), fail)} The termination criterion checks for two Else conditions. The first is satisfied if the next CQi := {(Coraplete(T(N,),Gj, ~+),neztj) [ search queue does not improve the sum of for each G~ E ,,~, next~ = empty if Gj the metric score over all hypotheses in the satisfies the noise criteria; otherwise, G$} End If queue. Second, there is no clause currently CQ := CQ u CQ~ being built for each theory in the search queue End For and the last finished clause of each theory sat- Q := the B best nodes from Q U CQ isfies the noise-handling criterion. Finally, a ranked by metric M committee of hypotheses found by the algo- Until terminatlon-criteria-satisfied Return Q rithm is returned. End Procedure 3.2 Compression and Accuracy The goal of the search is to find accurate Figure 2: The TABULATE algorithm and yet simple hypotheses. We measure accuracy using the m-estimate (Cestnik, 1990), a smoothed measure of accuracy on the training the most specific hypothesis (the set of posi- data which in the case of a two-class problem tive examples each represented as a separate clause). Each iteration of the loop attempts is defined as: to refine each of the hypotheses in the current search queue. There are two cases in each it- accuracy(H) = s + m. p+ (1) eration: 1) an existing clause in a theory is n ,-I- rrt refined or 2) a new clause is begun. Clauses are learned using both top-down specialiT.~- where s is the n-tuber of positive examples tion using a method similar to FOIL (Quin- covered by the hypothesis H, n is the total lan, 1990) and bottom-up generalization using number of examples covered, p+ is the prior Least General Generalizations (LGG's). Ad- probability of the class (9, and m is a smooth- vantages of combining both ILP approaches ing parameter. were explored in CHILLIN (ZeUe and Mooney, We measure theory complexity using a met- 1994), an ILP method which motivated the ric slmi]ar to that introduced in (Muggleton design of TABULATE. An outline of the TAB- and Buntine, 1988). The size of a Clause hav- ULATE algorithm is given in Figure 2. ing a Head and a Body is defined as follows A noise-handling criterion is used to de- (ts="term size" and ar="arity'): cide when an individual clause in a hypothesis is sufficiently accurate to be permanently size(Clause) = 1 + ts(Head) + ts(Body) (2)

136 I 1 T is a variable ISi G S is the set of states to which the ac- ts(T) = 2 r ~,, ¢o~t tion is applicable and OSi C_ S is the set of 2 + ts(argi(T)) states constructed by the action. The function (3) ao(l) : Sentences ~ IniS maps each sentence l The size of a clause is roughly the n,,mber of to a corresponding unique initial parse state in variables, constants, or predicate symbols it In/S C_ S. A state is called afinalstate if there contains. The size of a theory is the sum of exists no parsing action applicable to it. The the sizes of its clauses. The metric M(H) used partial function a,+l(s) : FS ~ Queries is as the search heuristic is defined as: defined as the action that retrieves the query from the final state s 6 FS C S if one exists. M(H) = accuracy(H) + C Some final states may not "contain" a query log2 size(H) (4) (e.g. when the parse stack contain.q predicates with unbound ~rariables) and therefore it is a where C is a constant used to control the rel- partial function. When the parser meets such ative weight of accuracy vs. complexity. We a final state, it reports a failure. ass~,me that the most general hypothesis is as A path is a finite sequence of parsing ac- good as the most specific hypothesis; thus, C tions. Given a sentence 1, a good state s is is determined to be: one such that there exists a path from it to a C = EbSt -- EtSb (5) query q 6 Q(1). Otherwise, it is a bad state. &-& The set of parse states can be uniquely divided into the set of good states and the set of bad where Et, Eb are the accuracy estimates of the states given l and Parser. S + and S- are the most general and most specific hypotheses re- sets of good and bad states respectively. spectively, and St, Sb are their sizes. Given a sentence l, the goal is to construct 3.3 Noise Handling the query ~ such that A clause needs no further refinement when it = argmqaX P(q • Q(l) [l ~ q) (7) meets the following criterion (as in RIPPER (Cohen, 1995)): where I ~ q means a path exists from l to q. Now, we need to estimate P(q • Q(1) I l =-~ P-.__.2_ > (6) p+n q). First, we notice that: where p is the number of positive examples P(q • Q(1) [l =~. q) ---- (8) covered by the clause, n is the number of neg- P(s • FS + I I ~ s and an+l(S) ---- q) ative examples covered and -1 probabilities as P(q • Q(l)) and of them still have ,nflni~hed or failed clauses. P(s • FS +) respectively, assuming these conditions in the following discussion. The equa- 4 Statistical Semantic Parsing tion states that the probability that a given 4.1 The Parsing Model query is a correct meaning for I is the same as the probability that the final state (reached A parser is a relation Parser C_ Sentences x Queries where Sentences and Queries are by parsing l) is a good state. We need to de- termine in general the probability of having a the sets of natural language sentences and good resulting parse state. Given any parse database queries respectively. Given a sentence I • Sentences, the set Q(1) = {q • state s i at the jth step of parsing and an action ai such that si+1 = a/(sj), we have: Queries I (l, q) • Parser} is the set of queries that are correct interpretations of I. PCsi+1 • (9) A parse state consists of a stack of lexical- ized predicates and a list of words from the pCsj+l e o&+ I • x&+)pCs • x&+) + input sentence. S is the set of states reach- P(Si+l • OSi+ I sj ¢ ISi+)P(sj ¢~ ISi+) able by the parser. Suppose our learned parser has n different parsing actions, the ith ac- where IS~ = ISi N S + and OS~ = OS~ N S +. tion a/is a function a/(s) : ISi -+ OSi where Since no parsing action can produce a good

137 parse state from a bad one, the second term {al,..., an-l} applicable to s is less than or is zero. Now, we are ready to derive P(q • equal to a threshold a. If A(s) is empty, we Q(l)). Suppose q = an+l(Sm), we have: assume the maxlrn,,rn is zero. More precisely, P(q 6 Q(l)) (10) PCa.Cs) • os~ Is • xs~) = (13) = P(s~ • F~) { ,c..(,)~os~) if maxCACs)) < ... ~(,~IS~) 0 otherwise = P(s,n • FS + lsm-1 •/St,_a)... P(s~ • OS~_, I sj-1 • Is~_,)... where a is the threshold, 3 c(an(s) • Ob~) is the count of the number of good states pro- P(s2 • Ob~, Is1 • IS~, )P('I • IS~,) duced by the last action, and c(s • IS~) is the where ak denotes the index of which action count of the number of good states to which is applied at the kth step. We assume that the last action is applicable. = P(sl • I~aa) ~ 0 (which may not be true Now, let's discusshow P(ai(s) • OS~~ I hk) in general). Now, we have and Ak are estimated. If hk ~ s (i.e. hk covers s), we have m--I pc + O " nc P(q 6 Q(l)) = 7 II P(sj+I • O~ l sj • IS~-3). PCai(s) • o~ I hk) = (14) i=l Pc -t- nc (11) Next we describe how we estimate the proba- where Pc and ne are the number of positive bili~ of the goodness of each action in a given and negative examples covered by hk respec- state (P(~(s) • o$ I s • I~)). We n~ tively. Otherwise, if h~ ~= s (i.e. hk does not not estimate 7 since its value does not affect cover s), we have the outcome of equation (7). 4.2 Estimating Probabilities for PCai(s) • OS7" I hk) -- p" + 8.n,, (15) Pu +nu Parsing Actions The committee of hypotheses learned by TAB- where Pu and nu are the n,,rnber of positive ULATE is used to estimate the probability that and negative examples rejected by hk respec- a particular action is a good one to apply to a tively. /9 is the probability that a negative given parse state. Some hypotheses are more example is mislabelled and its value can be "important" than others in the sense that they estimated given # (in equation (6)) and the carry more weight in the decision. A weight- total nnrnber of positive and negative exam- ing parameter is also included to lower the ples. probability estimate of actions that appear One could use a variety of linear combina- fm'ther down the decision list. For actions ai tion methods to estimate the weights Ak (e.g. where 1 < i < n - 1: Bayesian combination (Buntine, 1990)). How- ever, we have taken a simple approach and P(ai(s) • o~ Is • Is7-) = (12) weighted hypotheses based on their relative simplicity: .po,(i)-I ~ AkP(~Cs) 60b~ ~ I h~) hk~H~ size(hk) -1 Ak = ~.lHd size(hi)_1" (16) where s is a given parse state, pos(i) is the z-d=l position of the action ai in the list of actions applicable to state s, Ak and 0 < /~ < 4.3 Searching for a Parse 1 are weighting parameters, z Hi is the set To find the most probably correct parse, the of hypotheses learned for the action ai, and parser employs a beam search. At each step, ~k A~ = 1. the parser finds all of the parsing actions ap- To estimate the probability for the last ac- plicable to each parse state on the queue and tion an, we devise a simple test that checks calculates the probability of goodness of each if the maximum of the set A(s) of proba- of them using equations (12) and (13). It then bility estimates for the subset of the actions SThe threshold is set to 0.5 for all the experiments 2p is set to 0.95 for all the experiments performed. performed.

138 computes the probability that the resulting were conducted using 10-fold cross validation. state of each possible action is a good state For each domain, the average recall (a.k.a. using equation (11), sorts the queue of possi- accuracy) and precision of the parser on dis- ble next states accordingly, and keeps the best joint test data are reported where: B options. The parser stops when a complete of correct queries produced parse is found on the top of the parse queue Recall = or a failure is reported. of test sentences Precision = # of correct queries produced 5 Experimental Results # of complete parses produced" 5.1 The Domains A complete parse is one which contains an ex- Three different domains are used to demon- ecutable query (which could be incorrect). A strate the performance of the new approach. query is considered correct if it produces the The first is the U.S. Geography domain. same answer set as the gold-standard query The database contains about 800 facts about supplied with the corpus. U.S. states like population, area, capital city, neighboring states, major rivers, major cities, 5.3 Results and so on. A hand-crafted parser, GEOBASE The results are presented in Table 1 and Fig- was previously constructed for this domain as ure 3. By switching from deterministic to a demo product for Turbo Prolog. The second probabilistic parsing, the system increased the application is the restaurant query system il- number of correct queries it produced. Re- lustrated in Figure 1. The database contains call increases almost monotonically with pars- information about thousands of restaurants ing beam width in most of the domains. I_m- in Northern California, including the name of provement is most apparent in the Jobs do- the restaurant, its location, its specialty, and a maln where probabilistic parsing signiBcantly guide-book rating. The third domain consists outperformed the deterministic system (80% of a set of 300 computer-related jobs automat- vs 68%). However, using a beam width of ically extracted from postings to the USENET one (and thus the probabilistic parser picks newsgroup austin.jobs. The database con- only the best action) results in worse perfor- talus the following information: the job title, mance than using the original purely logic- the company, the recruiter, the location, the based determlni~tic parser. This suggests that salary, the languages and platforms used, and the probability esitmates could be improved required or desired years of experience and de- since overall they are not indicating the sin- grees. gle best action as well as a non-probabilistic approach. Precision of the system decreased 5.2 Experimental Design with beam width, but not signi~cantly except The geography corpus contains 560 questions. for the larger Geography corpus. Since the Approximately 100 of these were collected system conducts a more extensive search for from a log of questions submitted to the web a complete parse, it risks increasing the num- site and the rest were collected in studies in- ber of incorrect as well as correct parses. The volving students in undergraduate classes at importance of recall vs. precision depends on our university. We also included results for the the relative cost of providing an incorrect an- subset of 250 sentences originally used in the swer versus no answer at all. Individual ap- experiments reported in (Zelle and Mooney, plications may require emphasizing one or the 1996). The remaining questions were specif- other. icaUy collected to be more complex than the All of the experiments were run on a original 250, and generally require one or more 167MHz UltraSparc work station under Sic- meta-predicates. The restaurant corpus con- stus Prolog. Although results on the parsing taln~ 250 questions automatically generated time of the different systems are not formally from a hand-built grammar Constructed to re- reported here, it was noted that the difference flect typical queries in this domain. The job between using a beam width of three and the corpus contains 400 questions automatically original system was less than two seconds in generated in a similar fashion. The beam all domains but increased to a~r0und twenty width for TABULATE was set~ to five for all the seconds when using a beam width of twelve. domains. The deterministic parser used only However, the current Prolog implementation the best hypothesis found. The experiments is not highly optimized.

139 Parsers \ Corpora Geo250 Geo560 Jobs400 Rest250 R P R P I~ P R P Prob-Parser(12) 80.40 88.16 71.61 78.94 80.50 86.56 99.20 99.60 Prob-Parser(8) 79.60 86.90 71.07 79.76 78.75 86.54 99.20 99.60 Prob-Parser(5) 78.40 87.11 70.00 79.51 74.25 86.59 99.20 99.60 Prob-Parser(3) 77.60 87.39 69.11 79.30 70.50 87.31 99.20 99.60 Prob-Parser(1) 67.60 90.37 62.86 82.05 34.25 85.63 99.20 99.60 TABULATE 75.60 92.65 69.29 89.81 68.50 87.54 99.20 99.60 Original CHILL 68.50 97.65 Hand-Built Parser 56.00 96.40

Table 1: Results For All Domains: R = % Recall and P = % Precision. Prob-Parser(B) is the probabilistic parser using a beam width of B. TABULATE is CHILL using the TABULATE induction algorithm with determ;nistic parsing.

100 ~. 100

9O 95 Geo250 , Geo560 ---x--- 80 Job¢,400 ~ - Rest250 ---c--

90 70 ~-~ ...... 60 85

50 x,, ...

Geo250 , 80 Geo560 ---×--- 40 Jobs400- ~* Rest50 -~-

30 i i i i / 75 i i i / 0 2 4 6 8 lO 12 2 4 6 8 10 12 B~n ,~Ze Seam S¢~

Figure 3: The recall and precision of the parser using various beam widths in the different domains

While there was an overall improvement in the worst on the original geography queries, recall using the new approach, its performance since it is difficult to hand-crat~ a parser that varied signiGcantly from dom~;~ to domain. handles a sn~cient variety of questions. As a result, the recall did not always improve dramatically by using a larger beam width. 6 Conclusion Domain factors possibly affecting the performance are the quality of the lexicon, the rel- A probabilistic framework for semantic shift- ative amount of data available for calculat- reduce parsing was presented. A new ILP ing probability estimates, and the problem of learning system was also introduced which '~parser incompleteness" with respect to the learns multiple hypotheses. These two tech- test data (i.e. there is not a path from a sen- niques were integrated to learn semantic tence to a correct query which happens when parsers for building NLI's to on|ine databases. '7 = 0). The performance of all systems were Experimental results were presented that basically equivalent in the restaurant domain, demonstrate that such an approach outper- where they were near perfect in both recall forms a purely logical approach in terms of and precision. This is because this corpus is the accuracy of the learned parser. relatively easier given the restricted range of possible questions due to the limited informa- 7 Ack~nowledgements tion available about each restaurant. The sys- This research was supported by a grant from tems achieved > 90% in recall and precision the Daimler-Chrysler Research and Technol- given only roughly 30% of the training data ogy Center and by the National Science Foun- in this domain. Finally, GEOBASE perfomed dation under grant n~I-9704943.

140 References tive logic programming. In Proceedings of the K. Ali and M. Pazzani. 1996. Error reduction Eleventh International Conference on Machine through learning multiple descriptions. Ma- Learning, pages 343--351, New Brunswick, NJ, July. chine Learning Journal, 24:3:100--132. J. M. Zelle and tL J. Mooney. 1996. Learning W. Buntine. 1990. A theory of learning classification rules. Ph.D. thesis, University of Technol- to parse database queries using inductive logic In Proceedings of the Thirteenth ogy, Sydney, Australia. programming. National Conference on Artificial Intelligence, B. Cestnik. 1990. Estimating probabilities: A cru- pages 1050--1055, Portland, OR, August. cial task in machine learning. In Proceedings of the Ninth European Conference on Artificial In- teUigence, pages 147-149, Stockholm, Sweden. W. W. Cohen. 1995. Fast effective rule induction. In Proceedings of the Twelfth Interna- tional Conference on Machine Learning, pages 115-123. L. Getoor and D. Jensen, editors. 2000. Papers from the AAA1 Workshop on Learning Statis- tical Models from Relational Data, Austin, TX. AAAI Press. G. G. Hendrix, E. Sacerdoti, D. Sagalowicz, and J. Slocum. 1978. Developing a natural language interface to complex data. AGM Transactions on Database Systems, 3(2):105-147. R. Knhn and R. De Mori. 1995. The application of semantic classification trees to natural language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5):449-.- 460. N. Lavrac and S. Dzeroski. 1994. Inductive Logic Programming: Techniques and Applications. Ellis Horwood. C. D. Mauning and H. Sch/itze. 1999. Founda- tions of Statistical Natural Language Process- ing. MIT Press, Cambridge, MA. Scott Miller, David StaUard, Robert Bobrow, and Richard Schwartz. 1996. A fully statistical approach to natural language interfaces. In Pro- ceedings of the 34th Annual Meeting of the As- sociation for Computational Linguistics, pages 55-61, Santa Cruz, CA. S. Muggleton and W. Buntine. 1988. Machine invention of first-order predicates by inverting resolution. In Proceedings of the Fifth Interna- tional Conference on Machine Learning, pages 339--352, Ann Arbor, MI, June. J. tL Q,inlan. 1990. Learning logical definitions from relations. Machine Learning, 5(3):239- 266. C. A. Thompson and R. J. Mooney. 1999. Au- tomatic construction of semantic lexicons for learning natural language interfaces. In Pro- ceedings of the Sixteenth National Conference on Artificial Intelligence, pages 487-493, Or- lando, FL, July. W. A. Woods. 1970. Transition network gram- mars for natural language analysis. Communi- cations of the Association for Computing Ma- chinery, 13:591-606. J. M. Zelle and R. J. Mooney. 1994. Combin- ing top-down and bottom-up methods in indue-

141