Prague Markup Language Framework

Jirka Hana and Jan Stˇ epˇ anek´ Charles University in Prague, Faculty of Mathematics and Physics [email protected]

Abstract linguistic resources; follows a section on annotation tools, a query engine and programming libraries. Fi- In this paper we describe the Prague Markup Language (PML), a generic and open XML- nally, we discuss related work and future plans. based format intended to define format of linguistic resources, mainly annotated corpora. 2 Data Format We also provide an overview of existing tools The base data format selected for the described an- supporting PML, including annotation editors, a corpus query system, software libraries, etc. notation framework, both for data exchange and as a memory-model reference, is Prague Markup Lan- guage (PML, Pajas and Stˇ epˇ anek,´ 2006). While de- 1 Introduction signing PML, we have followed the following set of Constructing a linguistic resource is a compli- desiderata: cated process. Among other things it requires a good choice of tools, varying from elementary data • Stand-off annotation principles: Each layer of conversion scripts over annotation tools and tools the linguistic annotation should be cleanly sep- for consistency checking, to tools used for semi- arated from the other annotation layers as well automatic treebank building (POS taggers, syntactic as from the original data. This allows for mak- parsers). If no existing tool fits the needs, a new one ing changes only to a particular layer without has to be developed (or some existing tool adapted or affecting the other parts of the annotation and extended, which, however, seldom happens in prac- data. tice). The variety of tools that exist and emerged • Cross-referencing and linking: Both links to from various NLP projects shows that there is no external document and data resources and links simple solution that would fit all. It is sometimes a within a document should be represented co- small missing feature or an incompatible data format herently. Diverse flexible types of external that disqualifies certain otherwise well-established links are required by the stand-off approach. tools in the eyes of those who decide which tools Supposed that most data resources (data, tag- to use for their annotation project. sets, and dictionaries) use the same principles, This paper presents an annotation framework that they can be more tightly interconnected. was from its very beginning designed to be extensible and independent of any particular annotation • Linearity and structure: The data format ought schema. While reflecting the feedback from several to be able to capture both linear and structure annotation projects, it evolved into a set of generic types of annotation. tools that is open to all kinds of annotations. The first section describes the Prague Markup • Structured attributes: The representation Language and the way it is used to define format of should allow for associating the annotated

Proceedings of the 6th Linguistic Annotation Workshop, pages 12–21, Jeju, Republic of Korea, 12-13 July 2012. c 2012 Association for Computational Linguistics units with complex and descriptive data Attribute-value structures (AVS’s), i.e. structures structures, similar to feature-structures. consisting of attribute-value pairs. For each pair, the name of the attribute and the type of • Alternatives: The vague nature of language the value is specified. The type can be any often leads to more than one linguistic inter- PML type, including an AVS. A typical usage pretation and hence to alternative annotations. example of an AVS structure is, for example This phenomenon occurs on many levels, from a structure gathering the annotation of several atomic values to compound parts of the annota- independent morphological categories (lemma, tion, and should be treated in a unified manner. case, gender, number). A special type of AVS • Human-readability: The data format should be is a container, a structure with just one non- human-readable. This is very useful not only in attribute member. the first phases of the annotation process, when Lists allowing several values of the same type to be the tools are not yet mature enough to reflect aggregated, either in an ordered or unordered all evolving aspects of the annotation, but also manner. For example, a sentence can be repre- later, especially for emergency situations when sented as an ordered list of tokens, whereas a for example an unexpected data corruption oc- set of pointers to an ontology lexicon could be cur that breaks the tools and can only be re- captured as an unordered list. paired manually. It also helps the programmers Alternatives used for aggregating alternative anno- while creating and debugging new tools. tations, ambiguity, etc. For example, noun and • Extensibility: The format should be extensible verb can be alternative values of the part-of- to allow new data types, link types, and similar speech attribute in the morphological analysis properties to be added. The same should apply of the word flies: only one of them is the cor- to all specific annotation formats derived from rect value, but we do not know (yet) which one. the general one, so that one could incremen- Sequences representing sequences of values of dif- tally extend the vocabulary with markup for ad- ferent types. Unlike list members, the members ditional information. of a sequence do not need to be of the same • XML based: XML format is widely used for type and they may be further annotated using exchange and storing of information; it offers XML attributes. There is also a basic support a wide variety of tools and libraries for many for XML-like mixed content (a sequence can programming languages. contain both text and other elements). A simple regular expression might be used to spec- Thus PML is an abstract XML-based format in- ify the order and optionality of the members. tended to be generally applicable to all types of an- To give a typical usage example, consider the notation purposes, and especially suitable for multi- phrase structure tree: each node has a sequence layered treebank annotations following the stand-off of child nodes of two data types, terminals and principles. A notable feature that distinguishes PML non-terminals. The content of each node would from other encoding schemes existing at the time of typically be an AVS capturing the phrase type its creation (see Section 4) is its generic and open for non-terminals and morphological informa- nature. Rather than being targeted to one particu- tion for terminals. lar annotation schema or being a set of specifically Links providing a uniform method for cross- targeted encoding conventions, PML is an open sys- referencing within a PML instance, referencing tem, where a new type of annotation can be intro- among various PML instances (e.g. between duced easily by creating a simple XML file called layers of annotation), and linking to other ex- PML schema, which describes the annotation by ternal resources (lexicons, audio data, etc.). means of declaring the relevant data types (see Fig- ure 1 for an example). Enumerated types which are atomic data types The types used by PML include the following: whose values are literal strings from a fixed fi-

13 Example of constituency tree annotation S VP NP

Figure 1: A PML schema deﬁning a simple format for representation of phrase structure trees

John Smith Sun May 1 18:56:55 2005

Figure 2: A sample phrase structure encoded in the format deﬁned in Figure 1

14 nite set. A typical example is a boolean type dependency tree from the words of each sen- with only two possible values, 0 and 1. tence (morphologically analysed on the previ- CData type representing all character-based data ous layer); without internal structure or whose inter- • a tectogrammatical layer consisting of deep- nal structure is not expressed by means of syntactic dependency trees interlinked in a m:n XML. For improved validation and optimal in- manner with the analytical layer and a valency memory representation, the cdata type decla- lexicon and carrying further relational annota- ration can be accompanied by a simple format tion, such as coreference and quotation sets. specification (identifier, reference, and the standard W3C XML Schema simple types for num- Many other corpora were encoded in the for- bers, date, time, language, . . . ). mat, including the Prague English Dependency Treebank,1 the Prague Arabic Dependency Tree- A PML schema can also assign roles to particular bank,2 the Prague Dependency Treebank of Spoken annotation constructions. The roles are labels from Language,3 the Prague Czech-English Dependency a pre-defined set indicating the purpose of the dec- Treebank,4 Czesl (an error tagged corpus of Czech larations. For instance, the roles indicate which data as a second language, (Hana et al., 2010)), the Lat- structures represent the nodes of the trees, how the vian Treebank,5 a part of the National Corpus of Pol- node data structures are nested to form a tree, which ish,6 the Index Thomisticus Treebank,7 etc. field in a data structure carries its unique ID (if any), or which field carries a link to the annotated data or Moreover, several treebanks were converted into other layers of annotation, and so on. the PML format, mostly to be searchable in the A new PML schema can be derived from an exist- query tool (see Section 3.3); e.g. the Penn Tree- ing one by just mentioning the reference to the old bank 3, the TIGER Treebank 1.0, the Penn – CU one and listing the differences in special PML ele- Chinese Treebank 6.0, the Penn Arabic Treebank 2 ments. – version 2.0, the Hyderabad Treebank (ICON 2009 A PML schema can define all kinds of annota- version), the Sinica Treebank 3.0 (both constituency tions varying from linear annotations of morphol- and CoNLL dependency trees), the CoNLL 2009 ST ogy, through constituency or dependency trees, to data, etc. The conversion programs are usually dis- complex graph-oriented annotation systems (coref- tributed as “extensions” (plug-ins) of TrEd (see Sec- erence, valency, discourse relations). The schema tion 3.2), but they can be run without the editor as provides information for validating the annotation well. data as well as for creating a relevant data model for 3 Tools their in-memory or database representation. To give a complex example, the annotation of the A data format is worthless without tools to process Prague Dependency Treebank 2.0 (PDT 2.0, Hajicˇ it. PML comes with both low level tools (valida- et al., 2006) was published in the PML format. It tion, libraries to load and save data) and higher level consists of four annotation layers, each defined by tools like annotation editors or querying and con- its own PML schema: version tools. Since the last published description of the framework (Pajas and Stˇ epˇ anek,´ 2008), the • a lowest word-form layer consisting of tok- enized text segmented just into documents and 1http://ufal.mff.cuni.cz/pedt2.0/ paragraphs; 2http://ufal.mff.cuni.cz/padt/PADT_1.0/ docs/index.html 3 • a morphological layer segmenting the token http://ufal.mff.cuni.cz/pdtsl/ 4http://ufal.mff.cuni.cz/pcedt2.0/ stream of the previous layer into sentences and 5http://dspace.utlib.ee/dspace/ attaching the appropriate morphological form, bitstream/handle/10062/17359/Pretkalnina_ lemma, and tag to each token; Nespore_etal_74.pdf 6http://nkjp.pl/ • an analytical layer building a morpho-syntactic 7http://itreebank.marginalia.it/

15 Figure 3: Sample sentence in the TrEd tree annotation tool tools were further improved and several new ones support for additional instances is rather straightfor- emerged. ward, it must be done manually, as we have not yet implemented an automatic API generator as we did 3.1 Low Level Tools for Perl. PML documents can be easily validated against their 3.2 Tree Editor TrEd schemas. The validation is implemented by translat- ing the PML schema into a Relax NG schema (plus TrEd, a graphical tree editor, is probably the most some Schematron rules) and then validating the doc- frequently used tool from the PML framework. It is uments using existing validation tools for Relax NG. a highly extensible and configurable multi-platform For schemas themselves, there exists another Re- program (running on MS Windows, Max OS and 9 lax NG schema that can validate them. Linux). TrEd can work with any PML data whose Most PML-related tools are written in Perl. The PML schema correctly defines (via roles) at least Treex::PML package (available at CPAN8) pro- one sequence of trees. Besides the PML format, vides object-oriented API to PML schemas and doc- TrEd can work with many other data formats, either uments. The library first loads the schema and by means of the modular input/output interface of then generates API tailored to the instances of the the PML library or using its own input/output back- schema. ends. Applications written in Java can build on a li- The basic editing capabilities of TrEd allow the brary providing objects supporting basic PML types user to easily modify the tree structure with drag- and utilities for reading and writing them to streams, and-drop operations and to easily edit the associated etc. Moreover, additional libraries provide support data. Although this is sufficient for most annotation for several PML instances (e.g. the PDT corpus and 9TrEd can open data in other formats, too, because it is able the Czesl corpus (Hana et al., 2010)). While adding to convert the data to PML and back on the fly, the conversion can be implemented as an XSLT transformation, Perl code or 8http://www.cpan.org/ executable program.

16 tasks, the annotation process can be greatly acceler- components: ated by a set of custom extension functions, called • an expressive query language supporting cross- macros, written in Perl. Macros are usually created layer queries, arbitrary boolean combinations to simplify the most common tasks done by the an- of statements, able to query complex data struc- notators. For example, by pressing “(”, the annota- tures. It also includes a sub-language for gener- tor toggles the attribute is_parenthesis of the ating listings and non-trivial statistical reports, whole subtree of the current node. which goes far beyond statistical features of While macros provide means to extend, accelerate e.g. TigerSearch. and control the annotation capabilities of TrEd, the concept of style-sheets gives users full control over • client interfaces: a graphical user interface with the visual presentation of the annotated data. a graphical query builder, a customizable vi- So far, TrEd has been used as an annotation sualization of the results, web-client interface, tool for PDT 2.0 and several similarly structured and a command-line interface. treebanking projects like Slovene (Dzeroskiˇ et al., 2006), Croatian (Tadic,´ 2007), or Greek Depen- • two interchangeable engines that evaluate dency Treebanks (Prokopidis et al., 2005), but also queries: a very efficient engine that requires for Penn-style Alpino Treebank (van der Beek et the treebank to be converted into a relational al., 2002), the semantic annotation in the Dutch database, and a somewhat slower engine which language Corpus Initiative project (Trapman and operates directly on treebank files and is useful Monachesi, 2006), the annotation of French sen- especially for data in the process of annotation. tences with PropBank information (van der Plas et The PML-TQ language offers the following dis- al., 2010), as well as for annotation of morphol- tinctive features: ogy using so-called MorphoTrees (Smrzˇ and Pajas, • selecting all occurrences of one or more nodes 2004) in the Prague Arabic Dependency Treebank from the treebanks with given properties and in (where it was also used for annotation of the depen- given relations with respect to the tree topol- dency trees in the PDT 2.0 style). ogy, cross-referencing, surface ordering, etc. TrEd is also one of the client applications to the querying system, see Section 3.3. • support for bounded or unbounded iteration The editor can also be used without the GUI (i.e. transitive closure) of relations10 just to run macros over given files. This mode supports several types of parallelization (e.g. Sun • support for multi-layered or aligned treebanks Grid Engine) to speed up processing of larger tree- with structured attribute values banks. This inspired the Treex project (Popel and • quantified or negated subqueries (as in “find all ˇ Zabokrtsky,´ 2010), a modular NLP software sys- clauses with exactly three objects but no sub- tem implemented in Perl under Linux. It is primar- ject”) ily aimed at machine translation, making use of the ideas and technology created during the Prague De- • referencing among nodes (find parent and child pendency Treebank project. It also significantly fa- that have the same case and gender but different cilitates and accelerates development of software so- number) lutions of many other NLP tasks, especially due to • natural textual and graphical representation of re-usability of the numerous integrated processing the query (the structure of the query usually modules (called blocks), which are equipped with corresponds to the structure of the matched uniform object-oriented interfaces. subtree) 3.3 Tree Query 10For example, descendant{1,3} (iterating parent rela- coref gram.rf 1, Data in the PML format can be queried in a query tion) or { } (iterating coreference pointer in PDT), sibling{-1,1} (immediately preceding or follow- tool called PML-Tree Query (PML-TQ, Pajas and ing sibling), order-precedes{-1,1} (immediately pre- Stˇ epˇ anek,´ 2009). The system consists of three main ceding or following node in the ordering of the sentence)

17 • sublanguage for post-processing and generat- ing reports (extracting values from the matched nodes and applying one or more layers of ﬁlter- ing, grouping, aggregating, and sorting)

• support for regular expressions, basic arith- metic and string operations in the query and post-processing For example, to get a frequency table of functions in the Penn Treebank, one can use the following query: nonterminal $n := [] >> for $n.functions give $1, count() Figure 4: Sample sentence in the feat annotation tool sort by $2 desc Which means: select all non-terminals. Take their functions and count the number of occurrences of each of them, sort them by this number. The output 72106 NP -> DT NN starts like the following: 65508 S -> NP VP . 738953 45995 NP -> -NONE- SBJ 116577 36078 NP -> DT JJ NN TMP 27189 31916 VP -> TO VP LOC 19919 28796 NP -> NNP NNP PRD 19793 23272 SBAR -> IN S CLR 18345 ...... More elaborate examples can be found in (Pajas and To extract a grammar behind the tree annotation Stˇ epˇ anek,´ 2009; Stˇ epˇ anek´ and Pajas, 2010). of the Penn Treebank is a bit more complex task: nonterminal $p := [ * $ch := [ ] ] 3.4 Other Tools >> give $p, $p.cat, In addition to the versatile TrEd tree editor, there first_defined($ch.cat,$ch.pos), lbrothers($ch) are several tools intended for annotation of non-tree >> give $2 & " -> " structures or for specific purposes: & concat($3," " over $1 sort by $4) >> for $1 give count(),$1 MEd is an annotation tool in which linearly- sort by $1 desc structured annotations of text or audio data can Which means: search for all non-terminals with a be created and edited. The tool supports mul- child of any type. Return the identifier of the par- tiple stacked layers of annotations that can be ent, its category, the category or part-of-speech of interconnected by links. MEd can also be used the child, and the number of the child’s left brothers. for other purposes, such as word-to-word align- From this list, return the second column (the par- ment of parallel corpora. ent’s category), add an arrow, and concatenate the Law (Lexical Annotation Workbench) is an edi- third column (child’s category or part-of-speech) of tor for morphological annotation. It supports all the children with the same parent (first column) simple morphological annotation (assigning a sorted according to the original word order. In this lemma and tag to a word), integration and com- list, output number of occurrences of each line plus parison of different annotations of the same the line itself, sorted by the number of occurrences. Running it on the Penn Treebank produces the fol- text, searching for particular word, tag etc. It lowing output: natively supports PML but can import from and 189856 PP -> IN NP export to several additional formats. 128140 S -> NP VP 87402 NP -> NP PP 11http://ufal.mff.cuni.cz/˜hana/feat.html

18 Feat11 is an environment for layered error annota- similar to PML, they both support stand-off annotation of learners corpora (see Figure 4). It has tions, feature structures, alternatives, etc. The fol- been used in the Czesl project (e.g. Hana et al., lowing points are probably the main differences be- 2010; Hana et al., 2012) to correct and anno- tween the frameworks: tate texts produced by non-native speakers of Czech. The corpus and its annotation is en- • LAF is an abstract format, independent of its coded in several interconnected layers: scan serialization to XML, which is specified by the of the original document, its transcription, to- Graph Annotation Format (GrAF; Ide and Su- kenized text encoding author’s corrections and derman, 2007). PML is an XML based format, two layers of error correction and annotation. but in principle it could be encoded in other Tokens on the latter three layers are connected structured languages such as JSON. by hyper-edges. • While PML allows encoding general graphs in the same way as GrAF, for certain specific 12 Capek (e.g. Hana and Hladka,´ 2012) is an anno- graphs it is recommended to use encoding by tation editor tailored to school children to in- XML structures: simple paths by a sequence of volve them in text annotation. Using this edi- XML elements and trees by embedding. This tor, they practice morphology and dependency- greatly simplifies parsing and validation and based syntax in the same way as they nor- prevent lots of errors. (In theory, these errors mally do at (Czech) schools, without any spe- should be prevented by the use of appropriate cial training. applications, but in practice the data are often modified by hand or low level tools.) In addi- The last three of the above tools are written Java tion, many problems can be solved significantly on top of the Netbeans platform, they are open and faster for trees or sequences than for general can be extended via plugins. Moreover, the Capek graphs. editor also has an iOS version for iPad. • PML is supported with a rich set of tools (TrEd 4 Related Work and other tools described in this paper). We were not able to find a similar set of tools for TEI The Text Encoding Initiative (TEI) provides LAF. guidelines13 for representing a variety of literary and linguistic texts. The XML-based format is very rich Plain text There are many advantages of a struc- and among other provides means for encoding lin- tured format over a plain-text vertical format (e.g. guistic annotation as well as some generic markup popular CoNLL Shared Task format). The main for graphs, networks, trees, feature-structures, and drawbacks of the simpler plain-text format is that links. On the other hand, it lacks explicit support it does not support standard encoding of meta in- for stand-off annotation style and makes use of enti- formation, and that complex structures (e.g. lists of ties, an almost obsoleted feature of XML, that orig- lists) and relations in multi-layered annotation are inates in SGML. There are no tools supporting the encoded in an ad-hoc fashion which is prone to er- full specification. rors. For details, see (Stranˇak´ and Stˇ epˇ anek,´ 2010). EXMARaLDA We have also used PML to encode ISO LAF, MAF, SynAF, GrAF The Linguistic the Czesl learner corpus. As the corpus uses lay- Annotation Format (LAF; Ide and Romary, 2004; ered annotation, the only established alternative was Suderman and Ide, 2006) was developed roughly at the tabular format used by EXMARaLDA (Schmidt, the same time as PML. It encodes linguistics struc- 2009). However, the format has several disadvan- tures as directed graphs; both nodes and edges might tages (see, e.g. Hana et al., 2010; Hana et al., 2012). be annotated with feature structures. LAF is very Most importantly, the correspondences between the 12http://ufal.mff.cuni.cz/styx/ original word form and its corrected equivalents or 13http://www.tei-c.org/Guidelines/P5/ annotations at other levels may be lost, especially

19 for errors in discontinuous phrases. The feat editor Linguistic Data Consortium, Philadelphia. CD-ROM, supports import from and export to several formats, CAT: LDC2001T10. including EXMARaLDA. Jirka Hana and Barbora Hladka.´ 2012. Getting more data – Schoolkids as annotators. In Proceedings of the 5 Future Work Eighth Conference on Language Resources and Eval- uation (LREC 2012), Istanbul. accepted. In the current speciﬁcation, PML instances use a Jirka Hana, Alexandr Rosen, Svatava Skodovˇ a,´ and dedicated namespace. A better solution would be Barbora Stindlovˇ a.´ 2010. Error-tagged Learner Cor- to let the user specify his or her own namespace pus of Czech. In Proceedings of The Fourth Linguistic in a PML schema. Support for handling additional Annotation Workshop (LAW IV), Uppsala. ˇ namespaces would also be desirable (one might Jirka Hana, Alexandr Rosen, Barbora Stindlova,´ and Petr Proceed- use it e.g. to add documentation or comments to Jager.¨ 2012. Building a learner corpus. In ings of the Eighth Conference on Language Resources schemas and data), however, this feature need much and Evaluation (LREC 2012), Istanbul. accepted. more work: if some PML elements are moved or Nancy Ide and Laurent Romary. 2004. International deleted by an application, should it also move or standard for a linguistic annotation framework. Nat. delete the foreign namespace content? Lang. Eng., 10(3-4):211–225, September. List members and alternative members are always Nancy Ide and Keith Suderman. 2007. Graf: a graph- represented by , resp. XML elements. based format for linguistic annotations. In Proceed- Several users requested a possibility to deﬁne a dif- ings of the Linguistic Annotation Workshop, LAW ’07, ferent name in a PML schema. This change would pages 1–8. ˇ make the data more readable for human eyes, but it Petr Pajas and Jan Stepˇ anek.´ 2006. XML-based rep- might complicate the internal data representation. resentation of multi-layered annotation in the PDT 2.0. In Richard Erhard Hinrichs, Nancy Ide, Martha We would also like to extend support of PML in Palmer, and James Pustejovsky, editors, Proceedings Java and add support for additional languages. of the LREC Workshop on Merging and Layering Lin- Finally, we plan to perform a detailed comparison guistic Information (LREC 2006), pages 40–47. with the LAF-based formats and create conversion Petr Pajas and Jan Stˇ epˇ anek.´ 2008. Recent advances in tools between PML and LAF. a feature-rich framework for treebank annotation. In Donia Scott and Hans Uszkoreit, editors, The 22nd Acknowledgements International Conference on Computational Linguis- tics – Proceedings of the Conference, volume 2, pages This work has been using language resources de- 673–680, Manchester. veloped and/or stored and/or distributed by the Petr Pajas and Jan Stˇ epˇ anek.´ 2009. System for query- LINDAT-Clarin project of the Ministry of Education ing syntactically annotated corpora. In Gary Lee and of the Czech Republic (project LM2010013). This Sabine Schulte im Walde, editors, Proceedings of the paper and the development of the framework were ACL-IJCNLP 2009 Software Demonstrations, pages 33–36, Singapore. Association for Computational Lin- also supported by the Grant Agency of the Czech guistics. Republic (grants 406/10/P193 and 406/10/P328). Martin Popel and Zdenekˇ Zabokrtskˇ y.´ 2010. TectoMT: Modular NLP framework. In Proceedings of IceTAL, References 7th International Conference on Natural Language Processing, pages 293–304, Reykjavik. Sasoˇ Dzeroski,ˇ Tomazˇ Erjavec, Nina Ledinek, Petr Pa- Prokopis Prokopidis, Elina Desypri, Maria Koutsom- jas, Zdenekˇ Zabokrtskˇ y,´ and Andreja Zele.ˇ 2006. To- bogera, Haris Papageorgiou, and Stelios Piperidis. wards a Slovene Dependency Treebank. In Proceed- 2005. Theoretical and practical issues in the construc- ings of the 5th International Conference on Language tion of a Greek dependency treebank. In In Proc. of Resources and Evaluation (LREC 2006), pages 1388– the 4th Workshop on Treebanks and Linguistic Theo- 1391. ries (TLT), pages 149–160. Jan Hajic,ˇ Jarmila Panevova,´ Eva Hajicovˇ a,´ Petr Thomas Schmidt. 2009. Creating and working with spo- Sgall, Petr Pajas, Jan Stˇ epˇ anek,´ Jirˇ´ı Havelka, Marie ken language corpora in EXMARaLDA. In LULCL Mikulova,´ Zdenekˇ Zabokrtskˇ y,´ and Magda Sevˇ cˇ´ıkova-´ II: Lesser Used Languages & Computer Linguistics II, Raz´ımova.´ 2006. Prague Dependency Treebank 2.0. pages 151–164.

20 Otakar Smrzˇ and Petr Pajas. 2004. MorphoTrees of Arabic and their annotation in the TrEd environment. In Mahtab Nikkhou, editor, Proceedings of the NEM- LAR International Conference on Arabic Language Resources and Tools, pages 38–41, Cairo. ELDA. Jan Stˇ epˇ anek´ and Petr Pajas. 2010. Querying diverse treebanks in a uniform way. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pages 1828–1835. European Language Resources Association. Pavel Stranˇak´ and Jan Stˇ epˇ anek.´ 2010. Representing layered and structured data in the CoNLL-ST format. In Alex Fang, Nancy Ide, and Jonathan Webster, editors, Proceedings of the Second International Conference on Global Interoperability for Language Resources, pages 143–152, Hong Kong. Keith Suderman and Nancy Ide. 2006. Layering and merging linguistic annotations. In Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing, NLPXML ’06, pages 89–92. Marko Tadic.´ 2007. Building the Croatian Dependency Treebank: the initial stages. In Contemporary Linguis- tics, volume 63, pages 85–92. Jantine Trapman and Paola Monachesi. 2006. Manual for the annotation of semantic roles in D-Coi. Techni- cal report, University of Utrecht. Leonoor van der Beek, Gosse Bouma, Robert Malouf, and Gertjan van Noord. 2002. The Alpino Depen- dency Treebank. In Computational Linguistics in the Netherlands CLIN 2001, Amsterdam. Lonneke van der Plas, Tanja Samardzic, and Paola Merlo. 2010. Cross-lingual validity of PropBank in the manual annotation of French. In Proceedings of The Fourth Linguistic Annotation Workshop (LAW IV), Up- psala.