<<

Annotating tense, mood and for English, French and German

Anita Ramm1,4 Sharid Loaiciga´ 2,3 Annemarie Friedrich4 Alexander Fraser4 1Institut fur¨ Maschinelle Sprachverarbeitung, Universitat¨ Stuttgart 2Departement´ de Linguistique, Universite´ de Geneve` 3Department of Linguistics and Philology, Uppsala University 4Centrum fur¨ Informations- und Sprachverarbeitung, Ludwig-Maximilians-Universitat¨ Munchen¨ [email protected] [email protected] anne,fraser @cis.uni-muenchen.de { }

Abstract features. They may, for instance, be used to clas- sify texts with respect to the epoch or region in We present the first open-source tool for which they have been produced, or for assigning annotating morphosyntactic tense, mood texts to a specific author. Moreover, in cross- and voice for English, French and Ger- lingual research, tense, mood, and voice have been man verbal complexes. The annotation is used to model the of tense between based on a set of language-specific rules, different language pairs (Santos, 2004; Loaiciga´ which are applied on dependency trees et al., 2014; Ramm and Fraser, 2016)). Identi- and leverage information about lemmas, fying the morphosyntactic tense is also a neces- morphological properties and POS-tags of sary prerequisite for identifying the semantic tense the . Our tool has an average accu- in synthetic languages such as English, French racy of about 76%. The tense, mood and or German (Reichart and Rappoport, 2010). The voice features are useful both as features extracted tense-mood-voice (TMV) features may in computational modeling and for corpus- also be useful for training models in computational linguistic research. linguistics, e.g., for modeling of temporal relations (Costa and Branco, 2012; UzZaman et al., 2013). 1 Introduction As illustrated by the examples in Figure1, rel- Natural language employs, among other devices evant information for determining TMV is given such as temporal adverbials, tense and aspect to by syntactic dependencies and partially by part- locate situations in time and to describe their tem- of-speech (POS) tags output by analyzers such as poral structure (Deo, 2012). The tool presented Mate (Bohnet and Nivre, 2012). However, the here addresses the automatic annotation of mor- parser’s output is not sufficient for determining phosyntactic tense, i.e., the tense-aspect combi- TMV features; morphological features and lexical nations, expressed in the morphology and information needs to be taken into account as well. of verbal complexes (VC). VCs are sequences of Learning TMV features from an annotated corpus verbal tokens within a verbal phrase. We address would be an alternative; however, to the best of German, French and English, in which the mor- our knowledge, no such large-scale corpora exist. phology and syntax also includes information on A sentence may contain more than one VC, and mood and voice. Morphosyntactic tenses do not the tokens belonging to a VC are not always con- always correspond to semantic tense (Deo, 2012). tiguous in the sentence (see VCs A and B in the For example, the morphosyntactic tense of the En- English sentence in Figure1). In a first step, our glish sentence “He is leaving at noon.” is present tool identifies the tokens that belong to a VC by progressive, while the semantic tense is future. In analysing their POS tags as well as the syntactic the remainder of this paper, we use the term tense dependency parse of the sentence. Next, TMV to refer to the morphological tense and aspect in- values are assigned according to language specific formation encoded in finite verbal complexes. hand-crafted sets of rules, which have been devel- Corpus-linguistic research, as well as automatic oped based on extensive data analysis. The system modeling of mono- and cross-lingual use of tense, contains approximately 32 rules for English and mood and voice will strongly profit from a reliable 26 rules for German and for French. The TMV automatic method for identifying these clausal values are output along with some additional in-

1 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 1–6 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-4001 (1) Output of MATE parser: VC PMOD ROOT NMOD

SBJ SBJ VC VC LOC NMOD It will , I hope , be examined in a positive light . PRP MD PRP VBP VB VBN IN DT JJ NN

ROOT OC

OC OC

SB NK SB MO OC Er sagt , die Briefe seien schon beantwortet worden . PPER VVFIN ART NN VAFIN ADV VVPP VAPP sg 3 pres ind pl 3 pres subj | | | | | | ats root mod suj obj.p suj suj mod det mod Elle sera , je l’ espere` , examinee´ dans un esprit positif . CLS V CLS CLO V VPP P DET NC ADJ ind s 3 fut ind s 3 pst f part s past | | | | | | | | | | (2) Extraction of verbal complexes based on dependencies; (3) Assignment of TMV features based on POS sequences, morphological features and lexical rules: A will be examined MD[will] VB[be] VBN futureI indicative passive B hope VBP → present indicative active C sagt VFIN[pres/ind] → present indicative active D seien beantwortet worden VAFIN[pres/ind] VVPP VVPP[worden] → present indicative passive E sera examinee´ V[ind/fut] VPP[part/past] → futureI indicative passive F espere` V[ind/pst] → present indicative active →

Figure 1: Example for TMV extraction. formation about the VCs into a TSV file which can rules for annotating tense, mood and voice for En- easily be used for further processing. glish, French and German. Furthermore, the on- line demo3 version of the tool allows for fast text Related work. Loaiciga´ et al.(2014) use rules processing without installing the tool. to automatically annotate tense and voice informa- tion in English and French parallel texts. Ramm 2 Properties of the verbal complexes and Fraser(2016) use similar tense annotation rules for German. Friedrich and Pinkal(2015) In this section, we describe the morphosyntactic provide a tool which, among other syntactic- features that we extract for verbal complexes. semantic features, derives the tense of English ver- bal complexes. This tense annotation is based on 2.1 Finite and non-finite VCs the set of rules used by Loaiciga´ et al.(2014) We define a verbal complex (VC) as a sequence For English, PropBank (Palmer et al., 2005) of verbs within a verbal phrase, i.e. a sentence contains annotations for tense, aspect and voice, may include more than one VC. In addition to the but there are no annotations for subjunctive con- verbs, a VC can also contain verbal particles and structions including modals. The German TuBa-¨ negation words but not arguments. We distinguish D/Z corpus only contains morphological features.1 between finite VCs which need to have at least one finite (e.g. “sagt” in Figure1), and non-finite Contributions. To the best of our knowledge, VCs which do not; the latter consist of verb forms our system represents the first open-source2 sys- such as , or infinitives (e.g. “to tem which implements a reliable set of derivation support”). Infinitives in English and German have 1http://www.sfs.uni-tuebingen.de/ascl/ to occur with the particles to or zu, respectively, ressourcen/corpora/tueba-dz.html 2https://github.com/aniramm/ 3https://clarin09.ims.uni-stuttgart. tmv-annotator de/tmv/

2 while in French, infinitives may occur alone. We VC “will be examined”, “examined” is marked as do not assign the TMV features to non-finite VCs. the main verb. Verb particles are considered as a Our tool marks finiteness of a VC using a binary part of the main verb and are attached to the cor- “yes” (finite) and “no” (non-finite). responding main verb, e.g., the main verb of the non-finite English VC “to move up” is “move-up.” 2.2 Tense, mood, voice The identification of TMV features for a VC re- German In general, the main verbs in German quires the analysis of lexical and grammatical in- have specific POS-tags (VV*) (see, for example, formation, such as inflections, given by the combi- (Scheible et al., 2013)). In most German VCs, nation of verbs. For example, the English present there is only one verb with such a POS-tag. How- continuous requires the auxiliary be in present ever, there are a few exceptions. For exam- tense and the gerundive form of the main verb (e.g. ple, the recipient passive is built with full verbs “(I) am speaking”). bekommen, kriegen, as well as lernen, lassen, Mood refers to the distinction between indica- bleiben and an additional meaning-bearing full tive and subjunctive. Both of these values are verb. Thus, in such constructions, there are two expressed in the inflection of finite verbs in all verbs tagged as VV* (e.g. “Ich bekommeVVFIN the considered languages. For example, the En- das Buch geschenktVVPP .” (“I receive the book glish verb “shall” is indicative, while its subjunc- donated”)). Recipient verbs are not treated as main tive form is “should.” In English, tense forms used verbs if they occur with an additional full verb. In in are often called conditionals; case there are no verbs tagged with VV*, the last for German, they are referred to as Konjunktiv. verb in the chain is considered to be the main verb. Voice differentiates between active and passive constructions. In all three languages, the passive 3 Deriving tense, mood and voice voice can be recognized by searching for a specific verb. For example, the in English In this section, we give a short overview of the requires the auxiliary be in a tense-specific form, methods used to derive TMV information. e.g., “(I) am being seen” for present progressive or “(he) has been seen” for present . 3.1 Extraction of VCs Details on how our tool automatically identifies TMV features will be described in Section3. The tokens of a VC are not necessarily contiguous. They may be separated by a coordination, adver- 2.3 Negation bials, etc., or even include nested VCs as in Figure VCs may include negation. Our tool outputs a bi- 1. This makes it necessary to take syntactic de- nary negation value to VCs depending on whether pendencies into account. The extraction of VCs a negation word (identified by checking for a in our tool is based on dependency parse trees in language-specific POS-tag) is part of the verbal the CoNLL format.4 The first step is the identi- dependency chain. If a negation exists, the feature fication of all VC-beginning elements vb within value is “yes”, and “no” otherwise. a sentence, which include finite verbs (English, French and German) and infinitival particles (En- 2.4 Main verb glish, German). They are identified by searching Within a VC, the main verb bears the semantic for specific POS-tags. For each vb, the remain- meaning. For example, in the English VC “would ing elements of the VC are collected by following have read,” the “read” is considered to the dependency relations between verbs. Consider be the main verb. The main verb feature may con- for example the finite verb “will” in Figure1. It tain a single verb or a combination of a verb with is identified as a vb because of its POS tag MD. the verb particle. In the following, we describe the We now follow the dependency path from “will” detection of the main verbs for each of the three to “be” and from “be” to ”examined”. The result- languages under consideration. ing VC is thus “will be examined.”

English and French. In English and French 4In this work, we use the Mate parser for all three lan- VCs, the very last verb in the VC is considered guages. https://code.google.com/archive/p/ to be the main verb. For example, in the English mate-tools/wikis/ParserAndModels.wiki.

3 finite mood tense voice example () finite mood tense voice example (active voice) present (I) work present (je) travaille presProg (I) am working presPerf (je) viens de travailler presPerf (I) have worked perfect (j’)ai travaille´ presPerfProg (I) have been working (je) travaillais past (I) worked pastSimp (je) travaillai pastProg (I) was working ind ind pastPerf (I) had worked pastPerf (j’)eus travaille´ pastPerfProg act (I) have been working pluperfect act (j’)avais travaille´ yes futureI pass (I) will work yes futureI pass (je) travaillerai futureIProg (I) will be working futureII (j’)aurai travaille´ futureII (I) will have worked futureProc (je) vais travailler futureIIProg (I) will have been working present (je) travaille condI (I) would work subj past (j’)aie travaille´ condIProg (I) would be working subj imperfect (je) travaillasse condII (I) would have worked no - - - travailler condIIProg (I) would have been working no - - - to work Table 2: TMV Combinations for French. Table 1: TMV combinations for English.

finite mood tense voice example (active voice) present (ich) arbeite 3.2 TMV extraction rules perfect (ich) habe gearbeitet imperfect (ich) arbeitete ind pluperfect act (ich) hatte gearbeitet English. The rules for English make use of the yes futureI pass (ich) werde arbeiten combinations of the functions of the verbs within futureII (ich) werde gearbeitet haben present (er) arbeite/arbeitete konjI a given VC. Such functions are for instance finite past (er) habe/hatte¨ gearbeitet konjII verb or passive auxiliary. According to the POS futureI+II (er) wurde¨ arbeiten / gear- beitet haben combination of a VC and lexical information, first, no - - - zu arbeiten the function of each verb within the VC is deter- mined. Subsequently, the combination of the de- Table 3: TMV combinations for German. rived functions is mapped to TMV values. For ex- ample, the following functions will be assigned to the verbs of the VC “will be examined” in Fig- distinguish between the two constructions. Table ure1: “will” finite-modal, “be” passive- 2 shows the French TMV combinations. → → auxiliary, “examined” past-participle. This → particular combination of verb functions leads to The rules are based on POS tags, mor- the TMV combination futureI/indicative/passive. German. phological analysis of the finite verbs and the lem- Table1 contains the set of possible TMV combi- mas of the verbs. We group the rules by the num- nations that our tool extracts for English. ber of tokens contained in the VC, as we have ob- French. The rules for French are defined on the served that each combination of TMV features re- basis of the reduction of the verbs to their mor- quires a particular number of tokens in the VC. For phological features. The morphological features each length, we specify which tense and mood of of the verbs are derived from the morphologi- the finite verb lead to a specific TMV. Similarly to cal analysis of the verbs, as well as their POS- French, in some contexts, we need to use lexical tags. The rules specify TMV values for each of information to decide on TMV. the possible sequences of the morphological fea- Take for example the VC “seien beantwortet tures. For example, the VC “sera examinee”´ is worden” from Figure1. Its POS sequence is mapped to the morphological feature combination VAFIN-VVPP-VAPP, so we use rules defined for V-indfut-V-partpast which, according to our rule the POS length of 3. We first check the mood set, leads to the TMV futureI/indicative/passive. of the finite verb “seien” which is subj (subjunc- In some cases, the lexical information is used to tive). The combination of subj with the morpho- decide between ambiguous configurations. For ex- logical tense of the finite verb pres leads to the ample, some perfect/active forms are ambiguous mood value konjunktivI and the tense value past. with present/passive forms. For instance, “Jean est As the verb werden, which is used for passive con- parti” and “Jean est menace”´ are both composed of structions in German, occurs in the VC, we derive the verb “est” + past participle, but they have dif- the voice value passive. Thus, the resulting anno- ferent meaning: “Jean has left” vs. “Jean is threat- tation is past/konjunktivI/passive. Table3 shows ened.” Information about the finite verb helps to TMV value combinations for German.

4 3.3 Extraction of voice The tool outputs a TSV file with TMV annota- In all three languages, it is difficult to distin- tions. An example output is shown in Table4. The guish between stative passive and tenses in the columns are specified as follows: sentence num- active voice. For instance, the German VCs “ist ber, indices of the elements of a VC separated by geschrieben (is written)” and “ist gegangen (has a comma, elements of a VC separated by a comma, gone)” are both built with the auxiliary sein and finite, main verb (if more than one, separated by a a past participle. The combination of POS tags is comma), tense, mood, voice, progressive (only for same for both cases, and the morphological fea- English), coordination and negation. The German tures of the finite verb (pres/ind) correspond to the TSV output has an additional column with bound- 5 German perfect tense in active voice. This, how- aries of a in which a VC is placed. We ad- ever, holds only for verbs of movement and a few ditionally provide a script for the conversion of the other verbs. Verbs such as “schreiben (to write)” annotations into HTML format which allows for are in this specific context present/passive (sta- quick and easy examination of the annotations. tive passive in present tense) and not perfect/active 5 Evaluation which is the case for the VC “ist gegangen”. To disambiguate between these constructions, We manually evaluate annotations for 157 German we use a semi-automatically crafted list of VCs, 151 English Vcs and 137 French VCs ex- the German and that form per- tracted from a set of randomly chosen sentences fect/active with the auxiliary sein/etreˆ (be) instead from Europarl (Koehn, 2005). The results are of haben/avoir (have), which is used for the ma- shown in Table5. jority of the verbs. We extract these lists from dif- ferent corpora by counting how often verbs occur Language tense mood voice all with sein/haben and etre/avoirˆ , respectively. We EN 81.5 88.1 86.1 76.8 manually validate the resulting verb lists. DE 80.8 84.0 81.5 76.4 When a VC with a POS sequence that is am- FR 86.1 93.4 82.5 75.2 biguous in the above explained way is detected, we check whether the main verb is in the list of Table 5: Accuracy of TMV features according to “sein/etre”ˆ verbs. If that is the case, the corre- manual evaluation. sponding active tense is annotated. Otherwise, the VC is assigned the corresponding passive tense. For French, the overall acurracy is 75%, while In the case of English, the disambiguation is the accuracy of German and English annotations somewhat easier. To differentiate between “is is 76%. Based on the manually annotated sam- written” and “has written,” we use information ple, we estimate that 23/59/85% (for EN/DE/FR) about the finite verb within the VC. In the case of the erroneous annotations are due to parsing where we have be, we assume to have passive errors. For instance, in the case of English, the voice in combination with an appropriate tense. In VC extraction process sometimes adds gerunds to case of have, the voice is active. the VC and interprets them as a present participle. Similarly, for French, a past participle is added, 4 Annotation tool which erroneously causes the voice assignment to be passive. Contrary to German and English, The tool is implemented in Python. It takes as in- French has higher mood accuracy, since mood is put the parsed text file in the CoNLL format. For largely encoded unambiguously in the verb mor- the rule development, as well as evaluation, we phology. For German, false or missing morpho- used the Mate parser (Bohnet and Nivre, 2012), logical annotation of the finite verbs causes some which can be applied on all of the three languages errors, and there are cases not covered by our rules addressed here. For German and French, we use for identifying stative passive. the joint model for parsing, tagging and morpho- Our rule sets have been developed based on ex- logical analysis including lemmatization. For En- tensive data analysis. This evaluation presents a glish, only tagging and parsing is required. In gen- 5 eral, the TMV annotation tool is applicable on the The clause boundary identification is based on the sen- tence punctuation (e.g. comma, colon, semicolon, hyphen, output of arbitrary parsers as long as their models etc). For more sophisticated clause boundary identification use the same POS- and dependency tags as Mate. for German, please refer to (Sidarenka et al., 2015).

5 sent verb main num id(s) VC verb fin tense mood voice neg coord 1 6,7 has climbed climbed yes presPerf indicative active no no 2 4,5 has crossed crossed yes presPerf indicative active no no 2 13,14 can ’t increase increase yes present indicative active yes no

Table 4: TSV output of the annotation tool for two English sentences: “Since then, the index has climbed above 10,000. Now that gold has crossed the magic $1,000 barrier, why can’t it increase ten-fold, too?” snapshot of the tool’s performance. The findings Ashwini Deo. 2012. Morphology. In Robert I. Bin- of this analysis will lead to improvement of the nick, editor, The Oxford Handbook of Tense and As- rules’ precision in future development iterations. pect, OUP. Annemarie Friedrich and Manfred Pinkal. 2015. Auto- 6 Conclusion matic recognition of habituals: a three-way classifi- cation of clausal aspect. In Proceedings of the 2015 We have presented an automatic tool which an- Conference on EMNLP. Lisbon, Portugal. notates English, French and German verbal com- Philipp Koehn. 2005. Europarl: A parallel corpus for plexes with tense, mood and voice. Our tool com- statistical machine translation. In Conference Pro- pensates for the lack of annotated data on this sub- ceedings: the tenth Machine Translation Summit. ject. It allows for large-scale studies of verbal Phuket, Thailand. tenses and their use within and across the three Sharid Loaiciga,´ Thomas Meyer, and Andrei Popescu- languages. This includes for instance typological Belis. 2014. English-French verb phrase alignment studies of the temporal interpretation of tenses, or in Europarl for tense translation modeling. In Pro- ceedings of the 9th International Conference on discourse studies interested in the referential prop- LREC. Reykjavik, Iceland. erties of tense. Large-scale annotated data with reliable accuracy also creates the possibility to Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Cor- train classifiers, machine translation systems and pus of Semantic Roles. Computational Linguistics other NLP tools. The same approach for extracting 31(1):71–106. tense, aspect and mood could also be implemented Anita Ramm and Alexander Fraser. 2016. Modeling for other languages. verbal inflection for English to German SMT. In Proceedings of the the First Conference on Machine Acknowledgment Translation (WMT). Berlin, Germany.

This work has received funding from the DFG Roi Reichart and Ari Rappoport. 2010. Tense sense disambiguation: a new syntactic polysemy task. In grant Models of Morphosyntax for Statistical Ma- Proceedings of the 2010 Conference on EMNLP. chine Translation (Phase 2), the European Union’s Massachusetts, USA. Horizon 2020 research and innovation programme Diana Santos. 2004. Translation-based corpus studies under grant No. 644402 (HimL), and Contrasting English and Portuguese tense and as- from the European Research Council (ERC) under pect systems. Rodopi. grant agreement No. 640550. Silke Scheible, Sabine Schulte im Walde, Marion We thank Andre´ Blessing for developing the Weller, and Max Kisselew. 2013. A compact but demo version of the tool. linguistically detailed database for german verb sub- categorisation relying on dependency parses from a web corpus: Tool, guidelines and resource. In In References Proceedings of the WAC-8. Lancaster, UK. Uladzimir Sidarenka, Andreas Peldszus, and Manfred Bernd Bohnet and Joakim Nivre. 2012. A transition- Stede. 2015. Discourse segmentation of German based system for joint part-of-speech tagging and texts. Journal for Language Technology and Com- labeled non-projective dependency parsing. In Pro- putational Linguistics 30(1):71–98. ceedings of the 2012 Joint Conference on EMNLP. Jeju Island, Korea. Naushad UzZaman, Hector Llorens, Leon Derczyn- ski, James Allen, Marc Verhagen, and James Puste- Francisco Costa and Antonio´ Branco. 2012. Aspectual jovsky. 2013. SemEval-2013 Task 1: TempEval- type and temporal relation classification. In Pro- Events, and Temporal Relations. In Proceedings of ceedings of the 13th Conference of the EACL. Avi- the SemEval 2013. Atlanta, Georgia. gnon, France.

6