Statistical Grammar Models and Lexicon Acquisition �

12 Statistical Grammar Mo dels and Lexicon Acquisition S Schulte i W H Schmid M Rooth S Riezler D Prescher Intro duction This pap er presents a framework for developing and training statisti cal grammar mo dels for the acquisition of lexicon information Util ising a robust parsing environment and mathematically welldened unsup ervised training metho ds the framework enables us to induce lexicon information from text corp ora Particular strengths of the ap proach concern i the fact that no extensive manual work is required to set up the framework and ii that the framework is applicable to any desired language It has already b een applied to English and German Carroll and Ro oth Beil et al Ro oth et al Schulte im Walde a Portuguese de Lima and Chinese Ho ckenmaier Manual work within the framework is reduced to a minimum since the necessary grammars need not go into detailed structures for the rele vant grammar asp ects to b e trained suciently The automatic training pro cess utilises a shallow parser emb edded in the mathematically well dened Exp ectationMaximisation algorithm The training approach en forces the lexicalised parameters in the statistical grammar to obtain linguistic reliability A basic assumption thereby exp ects that the lin guistically correct analyses of text corresp ond to those analyses which Linguistic Form and its Computation Edited by Christian Rohrer Antje Rodeutscher and Hans Kamp c Copyright CSLI Publications S Schulte i W H Schmid M Rooth S Riezler D Prescher maximise the probability of the data The linguistic value of the grammar mo dels mainly lies in the lex icalised mo del parameters they contain lexicalised rules ie grammar rules referring to a sp ecic lexical head and lexical choice parameters a measure of lexical coherence b etween lexical heads Concerning verbs for example the lexical rule parameters serve as basis for probability distri butions over sub categorisation frames and the lexical choice parameters supply us with nominal heads of sub categorised noun phrases as basis for selectional constraints The information can b e used straightly as lex ical description or as input for lexicon to ols such as semantic clustering techniques Ro oth et al Schulte im Walde a or as basis for a variety of applications eg parser improvement Riezler et al chunking Schmid and Schulte im Walde or machine transla tion Prescher et al The reader might still wonder ab out the exact nature of the lexi cal information we gain Consider this concrete example our trained grammar mo del for German informs you that the verb essen eat most probably o ccurs transitively but might as well o ccur intransitively In addition we learn that eg the most frequent nominal heads in the di rect ob ject slot of the transitive frame are the German equivalent nouns for bread meat banana and icecream The rst part of this chapter concerns the grammar development and its training section allows practical insights into the prerequi sites for our statistical grammars and describ es a characteristic grammar development pro cess by means of the German grammar Following in section the reader will nd an intro duction to the theoretical back ground of statistical grammars and their headlexicalised renements as well as a description of their training facilities Section then presents the application of the training pro cedure concerning the German gram mar example The second part of this chapter illustrates various p ossibilities to ex ploit the lexicalised probability mo dels section straightly utilises the mo del parameters to extract lexical parameters for mainly verbs and to apply sp ecic parsing facilities such as Viterbi parsing or noun chunking Section demonstrates the usage of lexical information with sp ecic reference to lexical coherence b etween verbs and sub cate gorised nouns as input for semantic clustering techniques Grammar Development Our statistical grammar mo dels can b e develop ed for arbitrary lan guages presupp osing i a corpus as source for empirical input data Statistical Grammar Models and Lexicon Acquisition ii a morphological analyser for analysing the corpus wordforms and assigning lemmas where appropriate and iii a contextfree grammar CFG for parsing the corpus data The grammar is supp osed to cover a sucient part of the corpus since in order to develop a statistical grammar mo del on basis of the grammar cf sections and a large amount of structural rela tions within parses is required The more corpus data is accessible for grammar training the more reliable the probability mo del will b e As mentioned in the intro duction manual work concerning the gram mar is reduced to a minimum The necessary grammars need not go into detailed structures for the relevant grammar asp ects to b e trained suf ciently The complete framework can b e set up within a few weeks time and easily b e transferred to a dierent language This prop erty advances the grammar framework compared to eg treebank gram mars Charniak since it do es not presupp ose a treebank for the relevant language So far we have worked on statistical grammar mo dels for En glish Carroll and Ro oth German an earlier version is de scrib ed in Beil et al Portuguese de Lima and Chi nese Ho ckenmaier The preparation of the relevant corpus data the task denition of the morphological analyser and a contextfree grammar are describ ed b elow For the purp ose of illustrating the gram mar development framework we concentrate on the German mo del We sp ecically describ e the grammar development facilities and outline the grammar structure Corpus Preparation We created two subcorp ora from the million token newspap er cor pus Huge German Corpus HGC a a subcorpus containing verbnal clauses with a total of million words and b a subcorpus containing million relative clauses with a total of million words Apart from nonnite clauses as verbal arguments there are no further clausal emb eddings and the clauses do not contain any punctuation ex cept for a terminal p erio d The average clause length is and words p er clause resp ectively Morphological Analyser We utilised a nitestate morphology Schiller and St ckert to as sign multiple morphological features such as partofsp eech tag case gender and numb er to the corpus words partly collapsed to reduce the numb er of analyses For example the word Bleibe either the case am biguous feminine singular noun residence or a p erson and mo de am S Schulte i W H Schmid M Rooth S Riezler D Prescher biguous nite singular present tense verb form of stay is analysed as follows analyse Bleibe BleibeNNFemAkkSg BleibeNNFemDatSg BleibeNNFemGenSg BleibeNNFemNomSg bleibenVSgPresInd bleibenVSgPresKonj bleibenVSgPresKonj Reducing the ambiguous categories leaves the two morphological analy ses Bleibe NNFemCasSg VVFIN Apart from assigning morphological analyses the to ol in addition serves as lemmatiser cf Schulze The German ContextFree Grammar The contextfree grammar contains rules with their heads marked With very few exceptions rules for co ordination Srule the rules do not have more than two daughters The terminal categories in the grammar corresp ond to the collapsed corpus tags assigned by the mor phology Grammar development is facilitated by a grammar development envi ronment of the featurebased grammar formalism YAP Schmid and b a chart browser that p ermits a quick and ecient discovery of grammar bugs Carroll Figure shows that the ambiguity in the chart is quite considerable even though grammar and corpus are restricted The grammar covers of the verbnal and of the rel ative clauses ie the resp ective part of the corp ora are assigned parses The following sections describ e two essential parts of the gram mar the noun chunks and the denition of sub categorisation frames For more details concerning the German grammar structure see Schulte im Walde b Noun Chunks On nominal categories in addition to the four cases Nom Gen Dat and Akk case features with a disjunctive interpretation such as Dir for Nom or Akk are used The grammar is written in such a way that non disjunctive features are intro duced high up in the tree Figures to illustrate the use of disjunctive features in the noun pro jections for the Statistical Grammar Models and Lexicon Acquisition FIGURE Chart Browser for Grammar Development German noun phrase eine gute Gelegenheit a go o d opp ortunity in all four cases the terminal NN contains the fourway ambiguous Cas case feature the Nbar NN and noun chunk NC pro jections disambiguate to twoway ambiguous case features Dir and Obl the weakstrong SwSt feature of NN allows or prevents combination with a determiner re sp ectively only at the noun phrase NP pro jection level the case feature app ears in disambiguated form The use of disjunctive case features re sults in some reduction in the size of the parse forest Essentially the full range of agreement inside the noun phrase is enforced Agreement b etween the sub ject NP and the tensed verb is not enforced by the gram mar in order to control the numb er of parameters and rules The noun chunk denition refers to Abneys chunk grammar or ganisation Abney the noun chunk NC is a pro jection that excludes p osthead complements and adverbial adjuncts intro duced higher than prehead mo diers and determiners but includes participial premo diers with their complements S Schulte i W H Schmid M Rooth S Riezler D Prescher NPNom NCDir ARTE NNFemDirSw ARTIndefE ADJE NNFemDirSw eine ADJE NNFemCasSg gute Gelegenheit FIGURE Noun Pro jection

Statistical Grammar Models and Lexicon Acquisition �

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support