Standards for language resources Normenausschuss NA-105-00-06 AA DIN – Deutsches Institut fur¨ Normung e.V., Burggrafenstraße 6, 10787 Berlin

General overview Work Level Who? Resource Metaschemata Layer-wise/specific • Type (methodology) schemata – Experts from industry, academia and administrations General – Experts are nominated – based on expertise and interest – Identification of PISA ISO 24619: – National mirror committees of ISO committee TC 37/SC 4, – resources Persistent identification and e.g. for Germany at DIN sustainable access How? – Metadata CMDI-1 ISO/CD 24622-1: CMDI-2/3: • – Stepwise procedures: The Component Resource-specific Proposals – working drafts – (draft) international standards Metadata Model metadata models – Consensus-based: drafting commenting/ballot – Data Categories DCR ISO 12620: ↔ Data Category Registry – National standards organizations provide infrastructure – Feature structures FSD ISO 24610-2: FSR ISO 24610-1: Feature system declaration Feature structure Ongoing standardization work: German involvement representation CMDI: Corpora LAF ISO 24612: MAF ISO 24611: • Linguistic annotation Morpho-syntactic – Definition of metadata for different (LR) types framework annotation framework – Modules: General (author, contact,. . . ) specific, per LR ↔ SynAF ISO 24615: CQLF: • Syntactic – Corpus Query Lingua Franca annotation framework – Principles for the comparison of the (formal) properties SemAF ISO 24617: of corpus query systems Semantic SynAF-2/ISO Tiger: annotation framework • Lexical data LMF ISO 24613: The TEI guidelines – XML-serialization for the SynAF model: < tiger2/ >[Bosch et al. 2012] offer a widely-spread – Based on TIGER-XML [K¨onig et al. 2003] platform for serializing Transcription of spoken language: • a subset of LMF – Format for encoding of transcribed spoken text components. – For the exchange of transcripts from different tools Terminological data TMF ISO 16642: TBX ISO 30042: Terminological markup TermBase eXchange framework Sample use cases LAF and its XML-serialization GrAF: Architecture of corpus standards • – Annotations in the American National Corpus (ANC) are represented LAF: in the LAF/GrAF stand-off format [http://www.anc.org/] • – Graph-based representation – The data structures of the relational database management system of primary data and annotations B3DB are based on LAF/GrAF [Eckart et al. 2010] – General exchange format LMF: • MAF: – There exists a version of WordNet in LMF format • – Tokenizing, word forms – UBY-LMF is used to enable the combination of various lexical resources, (single, fused, multi-word) e.g. FrameNet, , etc. [Eckle-Kohler et al. 2012] – Inflection – Dutch Cornetto project: SynAF: Large in LMF (Referentiebestand Nederlands) • CLARIN-D: [http://www.clarin-d.de] – Grammatical features • – Phrase structures – 3-layer-framework – Dependency structures – Internal processing format intended to be relatable SemAF: several aspects of , • with ISO standards a.o. semantic roles, semantic features, time expressions Transcription of spoken language (upcoming): exchange of transcribed • speech material, e.g. from CHAT, ELAN, Transcriber, Exmaralda, etc.

Contact points & Information DIN and ISO: Contact person: Gottfried Herzog (NA Terminologie, DIN) DIN: http://www.din.de/ DIN-NA 105: http://www.nat.din.de/ ISO: http://www.iso.org/ ISOcat: http://www.isocat.org/ German institutions currently involved: Heinrich-Heine-Universit¨at Dusseldorf¨ Lingenio GmbH Universit¨at Bielefeld Universit¨at Stuttgart DFKI GmbH Saarbrucken¨ Humboldt-Universit¨at zu Berlin Ruhr-Universit¨at Bochum Universit¨at Erlangen-Nurnberg¨ Universit¨at Tubingen¨ Fachhochschule K¨oln Institut fur¨ Deutsche Sprache, Mannheim Technische Universit¨at Darmstadt Universit¨at Hildesheim

Bosch et al. 2012 Bosch, Sonja, Key-Sun Choi, Eric´ Villemonte De La Clergerie, Alex Chengyu Fang, Gertrud Faass, Kiyong Lee, Antonio Pareja-Lora, Laurent Romary, Andreas Witt, Amir Zeldes & Florian Zipser (2012) “ as a standardized serialisation for ISO 24615 - SynAF”. Proceedings of the 11th International Workshop on and Linguistic Theories (TLT11). 37–60. Lisbon, Portugal. Eckart et al. 2010 Eckart, Kerstin, Kurt Eberle & Ulrich Heid (2010) “An Infrastructure for More Reliable Corpus Analysis”. Proceedings of the Workshop on Web Services and Processing Pipelines in HLT: Tool Evaluation, LR Production and Validation (LREC 2010). 8–14. Valletta, Malta. Eckle-Kohler et al. 2012 Eckle-Kohler, Judith, Iryna Gurevych, Silvana Hartmann, Michael Matuschek & Christian M. Meyer (2012) “UBY-LMF - A Uniform Model for Standardizing Heterogeneous Lexical-Semantic Resources in ISO-LMF”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). 275–282. Istanbul, Turkey. K¨onig et al. 2003K ¨onig, Esther, Wolfgang Lezius & Holger Voormann (2003) “TIGERSearch 2.1 User’s Manual. Chapter V - The TIGER-XML encoding format”. IMS, Universit¨at Stuttgart, Germany. Standards for language resources Normenausschuss NA-105-00-06 AA DIN – Deutsches Institut fur¨ Normung e.V., Burggrafenstraße 6, 10787 Berlin

All valid standards issued by NA 105-00-06 AA

Document Publication Type Title number date

ISO 24610-1 2006-04 Standard Language resource management - Feature structures - Part 1: Feature structure representation ISO 24610-2 2011-10 Standard Language resource management - Feature structures - Part 2: Feature system declaration ISO 24611 2012-11 Standard Language resource management - Morpho-syntactic annotation framework (MAF) ISO 24612 2012-06 Standard Language resource management - Linguistic annotation framework (LAF) ISO 24613 2008-11 Standard Language resource management - Lexical markup framework (LMF) ISO 24614-1 2010-11 Standard Language resource management - Word segmentation of written texts - Part 1: Basic concepts and general principles ISO 24614-2 2011-09 Standard Language resource management - Word segmentation of written texts - Part 2: Word segmentation for Chinese, Japanese and Korean ISO 24615 2010-10 Standard Language resource management - Syntactic annotation framework (SynAF) ISO 24616 2012-09 Standard Language resources management - Multilingual information framework ISO 24617-1 2012-01 Standard Language resource management - Semantic annotation framework (SemAF) - Part 1: Time and events (SemAF-Time, ISO-TimeML) ISO 24617-2 2012-09 Standard Language resource management - Semantic annotation framework (SemAF) - Part 2: Dialogue acts ISO/DIS 24617-4 2013-06 Draft Language resource management - standard Semantic annotation framework (SemAF) - Part 4: Semantic roles (SemAF-SR) ISO 24619 2011-05 Standard Language resource management - Persistent identification and sustainable access (PISA)