Standards for language resources Normenausschuss NA-105-00-06 AA DIN – Deutsches Institut fur¨ Normung e.V., Burggrafenstraße 6, 10787 Berlin
General overview Standardization Work Level Who? Resource Metaschemata Layer-wise/specific • Type (methodology) schemata – Experts from industry, academia and administrations General – Experts are nominated – based on expertise and interest – Identification of PISA ISO 24619: – National mirror committees of ISO committee TC 37/SC 4, – resources Persistent identification and e.g. for Germany at DIN sustainable access How? – Metadata CMDI-1 ISO/CD 24622-1: CMDI-2/3: • – Stepwise procedures: The Component Resource-specific Proposals – working drafts – (draft) international standards Metadata Model metadata models – Consensus-based: drafting commenting/ballot – Data Categories DCR ISO 12620: ↔ Data Category Registry – National standards organizations provide infrastructure – Feature structures FSD ISO 24610-2: FSR ISO 24610-1: Feature system declaration Feature structure Ongoing standardization work: German involvement representation CMDI: Corpora LAF ISO 24612: MAF ISO 24611: • Linguistic annotation Morpho-syntactic – Definition of metadata for different language resource (LR) types framework annotation framework – Modules: General (author, contact,. . . ) specific, per LR ↔ SynAF ISO 24615: CQLF: • Syntactic – Corpus Query Lingua Franca annotation framework – Principles for the comparison of the (formal) properties SemAF ISO 24617: of corpus query systems Semantic SynAF-2/ISO Tiger: annotation framework • Lexical data LMF ISO 24613: The TEI guidelines – XML-serialization for the SynAF model: < tiger2/ >[Bosch et al. 2012] Lexical markup framework offer a widely-spread – Based on TIGER-XML [K¨onig et al. 2003] platform for serializing Transcription of spoken language: • a subset of LMF – Format for encoding of transcribed spoken text components. – For the exchange of transcripts from different tools Terminological data TMF ISO 16642: TBX ISO 30042: Terminological markup TermBase eXchange framework Sample use cases LAF and its XML-serialization GrAF: Architecture of corpus standards • – Annotations in the American National Corpus (ANC) are represented LAF: in the LAF/GrAF stand-off format [http://www.anc.org/] • – Graph-based representation – The data structures of the relational database management system of primary data and annotations B3DB are based on LAF/GrAF [Eckart et al. 2010] – General exchange format LMF: • MAF: – There exists a version of WordNet in LMF format • – Tokenizing, word forms – UBY-LMF is used to enable the combination of various lexical resources, (single, fused, multi-word) e.g. FrameNet, Wiktionary, etc. [Eckle-Kohler et al. 2012] – Inflection – Dutch Cornetto project: SynAF: Large lexicon in LMF (Referentiebestand Nederlands) • CLARIN-D: [http://www.clarin-d.de] – Grammatical features • – Phrase structures – 3-layer-framework – Dependency structures – Internal processing format intended to be relatable SemAF: several aspects of lexical semantics, • with ISO standards a.o. semantic roles, semantic features, time expressions Transcription of spoken language (upcoming): exchange of transcribed • speech material, e.g. from CHAT, ELAN, Transcriber, Exmaralda, etc.
Contact points & Information DIN and ISO: Contact person: Gottfried Herzog (NA Terminologie, DIN) DIN: http://www.din.de/ DIN-NA 105: http://www.nat.din.de/ ISO: http://www.iso.org/ ISOcat: http://www.isocat.org/ German institutions currently involved: Heinrich-Heine-Universit¨at Dusseldorf¨ Lingenio GmbH Universit¨at Bielefeld Universit¨at Stuttgart DFKI GmbH Saarbrucken¨ Humboldt-Universit¨at zu Berlin Ruhr-Universit¨at Bochum Universit¨at Erlangen-Nurnberg¨ Universit¨at Tubingen¨ Fachhochschule K¨oln Institut fur¨ Deutsche Sprache, Mannheim Technische Universit¨at Darmstadt Universit¨at Hildesheim
Bosch et al. 2012 Bosch, Sonja, Key-Sun Choi, Eric´ Villemonte De La Clergerie, Alex Chengyu Fang, Gertrud Faass, Kiyong Lee, Antonio Pareja-Lora, Laurent Romary, Andreas Witt, Amir Zeldes & Florian Zipser (2012) “
All valid standards issued by NA 105-00-06 AA
Document Publication Type Title number date
ISO 24610-1 2006-04 Standard Language resource management - Feature structures - Part 1: Feature structure representation ISO 24610-2 2011-10 Standard Language resource management - Feature structures - Part 2: Feature system declaration ISO 24611 2012-11 Standard Language resource management - Morpho-syntactic annotation framework (MAF) ISO 24612 2012-06 Standard Language resource management - Linguistic annotation framework (LAF) ISO 24613 2008-11 Standard Language resource management - Lexical markup framework (LMF) ISO 24614-1 2010-11 Standard Language resource management - Word segmentation of written texts - Part 1: Basic concepts and general principles ISO 24614-2 2011-09 Standard Language resource management - Word segmentation of written texts - Part 2: Word segmentation for Chinese, Japanese and Korean ISO 24615 2010-10 Standard Language resource management - Syntactic annotation framework (SynAF) ISO 24616 2012-09 Standard Language resources management - Multilingual information framework ISO 24617-1 2012-01 Standard Language resource management - Semantic annotation framework (SemAF) - Part 1: Time and events (SemAF-Time, ISO-TimeML) ISO 24617-2 2012-09 Standard Language resource management - Semantic annotation framework (SemAF) - Part 2: Dialogue acts ISO/DIS 24617-4 2013-06 Draft Language resource management - standard Semantic annotation framework (SemAF) - Part 4: Semantic roles (SemAF-SR) ISO 24619 2011-05 Standard Language resource management - Persistent identification and sustainable access (PISA)