Improving Verb Phrase Extraction from Historical Text by Use of Verb Valency Frames

Improving Verb Phrase Extraction from Historical Text by use of Verb Valency Frames Eva Pettersson and Joakim Nivre Department of Linguistics and Philology Uppsala University [email protected] Abstract ved (”chop wood”), salja¨ fisk (”sell fish”) or tjana¨ In this paper we explore the idea of using som piga (”serve as a maid”). Based on this ob- verb valency information to improve verb servation, Pettersson et al. (2012) developed a phrase extraction from historical text. As method for automatically extracting verb phrases a case study, we perform experiments on from historical documents by use of spelling nor- Early Modern Swedish data, but the ap- malisation succeeded by tagging and parsing. Us- proach could easily be transferred to other ing this approach, it is possible to correctly iden- languages and/or time periods as well. We tify a large proportion of the verbs in Early Mod- show that by using verb valency infor- ern Swedish text. Due to issues such as differences mation in a post-processing step to the in word order and significantly longer sentences verb phrase extraction system, it is pos- than in present-day Swedish texts (combined with sible to remove improbable complements sentence segmentation problems due to inconsis- extracted by the parser and insert probable tent use of punctuation), it is however still hard for complements not extracted by the parser, the parser to extract the correct complements as- leading to an increase in both precision sociated with each verb. and recall for the extracted complements. In this work we propose a method for improving verb phrase extraction results by providing verb 1 Introduction valency information to the extraction process. We Information extraction from historical text is a describe the effect of removing improbable com- challenging field of research that is of interest not plements from the extracted verb phrases, as well only to language technology researchers but also as adding probable complements based on verb to historians and other researchers within the hu- valency frames combined with words and phrases manities, where information extraction is still to a occurring in close context to the head verb. large extent performed more or less manually due 2 Related Work to a lack of NLP tools adapted to historical text and insufficient amounts of annotated data for training Syntactic analysis of historical text is a tricky task, such tools. due to differences in vocabulary, spelling, word or- In the Gender and Work project (GaW), his- der, and grammar. Sanchez-Marco´ (2011) trained torians are building a database with information a tagger for Old Spanish, based on a 20 million to- on what men and women did for a living in the ken corpus of texts from the 12th to the 16th cen- Early Modern Swedish society, i.e. approximately tury, by expanding the dictionary and modifying 1550–1800 (Agren˚ et al., 2011). This information tokenisation and affixation rules. An accuracy of is currently extracted by researchers manually go- 94.5% was reported for finding the right part-of- ing through large volumes of text from this time speech, and an accuracy of 89.9% for finding the period, searching for relevant text passages de- complete morphological tag. scribing working activities. In this process, it has In many cases, there is a lack of large corpora been noticed that working activities often are de- for training such tools, and alternative methods are scribed in the form of verb phrases, such as hugga called for. Schneider (2012) presented a method Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 153 for adapting the Pro3Gres dependency parser to For modern Swedish, Øvrelid and Nivre (2007) analyse historical English text. In this approach, experimented on ways to improve parsing accu- spelling normalisation is a key factor, transform- racy for core grammatical functions including for ing the historical spelling to a modern spelling by example object, subject predicative, and preposi- use of the VARD2 tool (Baron and Rayson, 2008) tional argument. They found that by providing the before parsing. In addition to spelling normali- parser with linguistically motivated features such sation, a set of handwritten grammar rules were as animacy, definiteness, pronoun type and case, added to capture unseen interpretations of specific a 50% error reduction could be achieved for the words, for relaxing word order constraints, and for syntactic functions targeted in the study. ignoring commas in a sentence. Schneider con- cluded that spelling normalisation had a large im- 3 Approach pact on parsing accuracy, whereas the grammar adaptations where easy to implement but lead to In this work, we adopt the verb phrase extrac- small improvements only. tion method presented in Pettersson et al. (2013), where verbs and complements are extracted from Pettersson et al. (2013) also presented an ap- historical text based on output from NLP tools de- proach to automatic annotation of historical text veloped for present-day Swedish. In addition to based on spelling normalisation. In this approach, their approach, we also include a post-processing the historical spelling is translated to a modern step, removing and/or inserting verbal comple- spelling employing character-based statistical ma- ments based on the valency frame of the head verb. chine translation (SMT) techniques, before tag- The full process is illustrated in Figure 1, where ging and parsing is performed by use of standard the first step is tokenisation of the source text by natural language processing tools developed for use of standard tools. The tokenised text is then to present-day language. The method was evaluated be linguistically annotated in the form of tagging on the basis of verb phrase extraction results from and parsing. To the best of our knowledge, there Early Modern Swedish text, where the amount of is however no tagger nor parser available trained correctly identified verb complements (including on Early Modern Swedish text. Since these tools partial matches) increased from 32.9% for the text are sensitive to spelling, the tokenised text is there- in its original spelling to 46.2% for the text in its fore normalised to a more modern spelling by use automatically modernised spelling. Earlier work of character-based SMT methods, before tagging by the same authors showed that using contempo- and parsing is performed using tools trained for rary valency dictionaries to remove extracted com- modern Swedish. For tagging, we use HunPOS plements not adhering to the valency frame of a (Halacsy´ et al., 2007) with a Swedish model based specific verb had a positive effect on verb phrase on the Stockholm-Umea˚ corpus, SUC version 2.0 extraction precision (Pettersson et al., 2012). (Ejerhed and Kallgren,¨ 1997). For parsing, we use In the context of valency-based parsing of mod- MaltParser version 1.7.2 (Nivre et al., 2006a) with ern language, Jakub´ıcekˇ and Kova´rˇ (2013) intro- a pre-trained model based on the Talbanken sec- duced a verb valency-based method for improving tion of the Swedish Treebank (Nivre et al., 2006b). Czech parsing. Their experiments were based on After tagging and parsing, the annotations given the Synt parser, which is a head-driven chart parser by the tagger and the parser are projected back with a hand-crafted meta-grammar for Czech, pro- to the text in its original spelling, resulting in a ducing a list of ranked phrase-structure trees as tagged and parsed version of the historical text, output. They used two different dictionaries with from which the verbs and their complements are valency information to rerank the suggested parses extracted. The complements included for extrac- in accordance with the valency frames suggested tion are the following: subject (for passive verbs for the verb in the dictionaries. Evaluation was only, where the subject normally corresponds to performed on the Brno Phrasal Treebank using the direct object in an active verb construction), the leaf-ancestor assessment metric, and an im- direct object, indirect object, prepositional com- provement from 86.4% to 87.7% was reported for plement, infinitive complement, subject predica- the highest-ranked tree when comparing the Synt tive, verb particle, and reflexive pronoun. As men- parser in its original setting to the inclusion of va- tioned, we also add a post-processing filter as a lency frames for reranking of the output parses. complementary step, using valency information to Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 154 Figure 1: Method overview. modify the complements suggested by the parser. approach to deletion of improbable complements, we therefore base the valency frames not only on 3.1 Deletion of Improbable Complements the contemporary valency dictionaries, but also on As discussed in Section 1, certain characteristics the verbal complements occurring in the training of historical text make it difficult for the parser part of the GaW corpus. to correctly extract the complements of a verb. Deletion experiments are performed for all Therefore, we add valency information in a post- complement types extracted from the parser ex- processing step, filtering away extracted comple- cept for subjects, since a verb is typically expected ments that do not conform to the valency frame of to have a subject. We present deletion experiments the verb. A similar idea was presented in Petters- for the following five settings: son et al. (2012), where filtering was based on valency frames given in two contemporary dictionar- 1. Lexin ies, i.e. Lexin1 and Parole2. However, some word For each extracted complement, if the head forms in historical text are not frequent enough in verb is present in the Lexin valency dictio- contemporary language to occur in modern dictionary and the valency frame in Lexin does naries.

Load more