<<

Large Semantic Network Manual Annotation V´aclav Nov´ak Institute of Formal and Applied Linguistics Charles University, Prague [email protected] Abstract This abstract describes a project aiming at manual annotation of the content of natural language utterances in a corpora. The formalism used in this project is MultiNet – Multilayered Ex- tended Semantic Network. The annotation should be incorporated into Prague Dependency as a new annotation layer.

1 Introduction

A formal specification of the semantic content is the aim of numerous semantic approaches such as TIL [6], DRT [9], MultiNet [4], and others. As far as we can tell, there is no large “real life” text corpora manually annotated with such markup. The projects usually work only with automatically generated annotation, if any [1, 6, 3, 2]. We want to create a parallel Czech-English corpora of texts annotated with the corresponding semantic network.

1.1 Prague Dependency Treebank From the linguistic viewpoint there language resources such as Prague Dependency Treebank (PDT) which contain a deep manual analysis of texts [8]. PDT contains annotations of three layers, namely morpho- logical, analytical (shallow dependency syntax) and tectogrammatical (deep dependency syntax). The units of each annotation level are linked with corresponding units on the preceding level. The morpho- logical units are linked directly with the original text. The theoretical basis of the treebank lies in the Functional Gener- ative Description of language system [7]. PDT 2.0 is based on the long-standing Praguian linguistic tradi- tion, adapted for the current computational-linguistics research needs. The corpus itself is embedded into the latest annotation technology. Software tools for corpus search, annotation, and language analysis are included. Extensive documentation (in English) is provided as well.

1.2 MultiNet The representational means of Multilayered Extended Semantic Net- works (MultiNet), which are described in [4], provide a universally

1 applicable formalism for treatment of semantic phenomena of natural language. To this end, they offer distinct advantages over the use of the classical predicate calculus and its derivatives. Moreover, semantic networks are convenient for manual annotation due to their cognitive adequacy. The semantic representation of natural language expressions by means of MultiNet is mainly independent of the considered language. In contrast, the syntactic constructs used in different languages to de- scribe the same content are obviously not identical. To bridge the gap between different languages we can employ the deep syntactico- semantic representation available in the Functional Generative De- scription framework.

2 Project Goals 2.1 Annotation Tool The screenshot of the pilot version of annotation tool is depicted in Figure 1. Apart from the Java GUI application, the XML schema in Prague Markup Language framework [5] is created. The tool allows the annotator to create concept nodes and functional nodes and connect the concepts with edges and metaedges. The layer attributes of nodes and edges are visible and editable at all times. The parallel data used for the pilot annotation come from The Wall Street Journal, an American newspaper, and its Czech translation.

2.2 Annotation Guidelines The guidelines are based on the MultiNet specification in [4] and will be continuously enriched by the annotators in cooperation with MultiNet specialists at Hagen university. The guidelines specification should further improve the MultiNet and enrich its specification. There is a long tradition of parallel off-line annotation and annota- tor training and coordination connected with PDT and this fact should also ease the process.

2.3 Annotation Evaluation The inter-annotator agreement must be measured and corresponding metrics developed. This requires some equivalence rules to account for differences in annotation which shouldn’t be considered as different with respect to the underlying content. Last but not least, the inter-language agreement will be a useful measure of how language independent this formalism really is and how far there is from MultiNet to true interlingua.

2 Figure 1: The annotation tool. Network concepts are connected to the tec- togrammatical tree (left) and the list of available concepts is shown (right).

3 Acknowledgement

This work is supported by • Czech Academy of Science grant 1ET201120505 • Czech Ministry of Education, Youth and Sports project LC536 The views expressed are not necessarily endorsed by the sponsors.

References

[1] Johan Bos. Towards Wide-Coverage Semantic Interpretation. In Proceedings of Sixth International Workshop on Computational Se- mantics IWCS-6, pages 42–53, 2005. [2] Ulrich Callmeier, Andreas Eisele, Ulrich Schfer, and Melanie Siegel. The DeepThought Core Architecture Framework. In Proceedings of LREC, May 2004. [3] Ingo Gl¨ockner, Sven Hartrumpf, and Rainer Osswald. From Ger- maNet glosses to formal meaning postulates. In Bernhard Fisseni, Hans-Christian Schmitz, Bernhard Schrder, and Petra Wagner, ed- itors, Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen – Beitr¨agezur GLDV-Tagung 2005 in Bonn, pages 394– 407. Frankfurt am Main: Peter Lang, 2005. [4] Hermann Helbig. Knowledge Representation and the Semantics of Natural Language. Springer-Verlag, Berlin Heidelberg, 2006.

3 [5] Petr Pajas and Jan Stˇep´anek.Aˇ Generic XML-Based Format for Structured Linguistic Annotation and Its Application to Prague Dependency Treebank 2.0. Technical Report 29, UFAL MFF UK, Praha, 2005. [6] AleˇsHor´ak. The Normal Translation Algorithm in Transparent Intensional Logic for Czech. PhD thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 2001. [7] Petr Sgall, Eva Hajiˇcov´a,and Jarmila Panevov´a. The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. D. Reidel Publishing company, Dodrecht, Boston, London, 1986. [8] Petr Sgall, Jarmila Panevov´a,and Eva Hajiˇcov´a. Deep Syntac- tic Annotation: Tectogrammatical Representation and Beyond. In A. Meyers, editor, Proceedings of the HLT-NAACL 2004 Work- shop: Frontiers in Corpus Annotation, pages 32–38, Boston, Mas- sachusetts, USA, 2004. Association for Computational Linguistics. [9] J.F.A.K. Van Benthem and H. Kamp. Representing Discourse in Context. In V. Benthem and A.T. Meulen, editors, Handbook of logic and language. Elsevier, Amsterdam, 1997.

4