A Key for Document Interoperability?

ELAN Electronic Government and Applications Feature Based Document Profiling - A Key for Document Interoperability? Bibliografische Information der Deutschen Nationalbibliothek: Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.d-nb.deabrufbar. 1.Auflage Juni 2012 Alle Rechte vorbehalten © Fraunhofer-Institut für Offene Kommunikationssysteme FOKUS, Juni 2012 Fraunhofer-Institut für Offene Kommunikationssysteme FOKUS Kaiserin-Augusta-Allee31 10589 Berlin Telefon: +49-30-3436-7115 Telefax: +49-30-3436-8000 [email protected] www.fokus.fraunhofer.de Dieses Werk ist einschließlich aller seiner Teile urheberrechtlich geschützt. Jede Ver- wertung, die über die engen Grenzen des Urheberrechtsgesetzes hinausgeht, ist ohne schriftliche Zustimmung des Instituts unzulässig und strafbar. Dies gilt insbesondere für Vervielfältigungen, Übersetzungen, Mikroverfilmungen sowie die Speicherung in elektronischen Systemen. Die Wiedergabe von Warenbezeichnungen und Handels- namen in diesem Buch berechtigt nicht zu der Annahme, dass solche Bezeichnungen im Sinne der Warenzeichen-und Markenschutz-Gesetzgebung als frei zu betrachten wären und deshalb von jedermann benutzt werden dürften. Soweit in diesem Werk direkt oder indirekt auf Gesetze, Vorschriften oder Richt-linien (z.B. DIN, VDI) Bezug genommen oder aus ihnen zitiert worden ist, kann das Institut keine Gewähr für Richtigkeit, Vollständigkeit oder Aktualität übernehmen. ISBN 978-3-00-038675-6 Feature Based Document Profiling ‐ a Key For Document Interoperability? Authors Dr. Klaus‐Peter Eckert Fraunhofer Institut FOKUS eMail: klaus‐[email protected] Kerstin Goluchowicz Technische Universität Berlin, Fachgebiet Innovationsökonomie eMail: kerstin.goluchowicz@tu‐berlin.de Dr. Stephan Gauch Technische Universität Berlin, Fachgebiet Innovationsökonomie eMail: stephan.gauch@tu‐berlin.de Björn Kirchhoff eGov Consulting and Development GmbH eMail: [email protected] i ii Feature Based Document Profiling ‐ a Key For Document Interoperability? ii iii Feature Based Document Profiling ‐ a Key For Document Interoperability? Management Summary The working group WG5 of the ISO/IEC subcommittee SC34 “Document Description and Processing Languages” performs research about “Document Interoperability” considering open document standards such as “Open Document Format ‐ ODF” and “Office Open XML ‐ OOXML”. The TransDok project (validation and transformation of selected profiles of the document standards ISO/IEC 26300 and ISO/IEC 29500), sponsored by the German Federal Ministry of Economics and Technology contributes to this research. It examines, if and how feature based document profiles can be defined and used as a means to identify interoperable subsets of both document standards, especially for typical documents used in the German public sector. Utilizing the document features identified in ISO/IEC TR 29166 (1), XML schema for the definition of document features and feature based profiles have been defined. A feature list generator has been implemented that creates a list of all features used within a document and in addition a list of all features including their relative and absolute occurrence in all documents contained in a given folder. The list can be used to identify those properties that are characteristic for the documents within the folder and to define an associated document profile. The feasibility of feature based profiles to describe common properties of document types has been analysed using mathematical classification methods. These methods show that at least typical features for certain document types exist. These features can be used to define an interoperable profile or template for a document type. In case the set of features is restricted to those that are characteristic and necessary and that allow a unique translation between both document standards an important step towards document interoperability and translation has been done. The average accuracy of our classification algorithms reaches levels above 70%, making these approaches a viable complementary option to improve classification of documents. If the ideas developed in the project are applied to typical document types in the German public sector, their interoperability and portability can be enhanced significantly. The integration of the feature list generator in archiving systems will enhance the likelihood of sustainable storage of documents and reduce interoperability problems significantly. The ideas developed in the project have been presented to ISO/IEC SC34 WG5 as well as to ODF plug fests. The results of the project are included in the current WG5 study period report and will probably influence the next work items in WG5. The project underlying this report was funded by the German Federal Ministry of Economics and Technology under grant number 01FS10017. The responsibility for the content of this publication lies with the authors. iii v Feature Based Document Profiling ‐ a Key For Document Interoperability? Contents Management Summary ......................................................................................................iii Contents ............................................................................................................................. v 1 Introduction ............................................................................................................ 1 1.1 Practical Relevance ............................................................................................. 2 2 State of the Art ........................................................................................................ 3 2.1 Open Document Formats ................................................................................... 3 2.1.1 Introduction to OOXML .................................................................................................. 4 2.1.2 Introduction to ODF ........................................................................................................ 6 2.2 Conformity and Interoperability Definitions ...................................................... 8 2.2.1 Office Open XML ............................................................................................................. 8 2.2.2 OpenDocument Format ................................................................................................ 10 2.2.3 Summary ....................................................................................................................... 12 2.3 Document Features .......................................................................................... 14 2.3.1 ISO/IEC TR 29166 .......................................................................................................... 15 2.4 Profiling and Document Interoperability .......................................................... 20 2.5 Tools and Languages ......................................................................................... 21 2.5.1 Document Packages ..................................................................................................... 21 3 Methodology ......................................................................................................... 22 3.1 Definition of Document Features ..................................................................... 22 3.2 Feature Based Profile Definition ....................................................................... 22 3.3 Profile Inspection .............................................................................................. 24 3.3.1 Binary Membership ...................................................................................................... 24 3.3.2 Statistical Membership ................................................................................................. 25 4 Technical Details .................................................................................................... 26 4.1 Feature List Generator ...................................................................................... 26 4.1.1 Using the Feature List Generator ................................................................................. 28 4.2 Profile Definition and Checking ........................................................................ 31 4.2.1 Definition and Testing of a Profile ................................................................................ 32 5 Profile Evaluation .................................................................................................. 33 5.1 Dataset and Pre‐processing .............................................................................. 34 5.2 Testbed Specifications ...................................................................................... 37 5.3 Classification Approaches ................................................................................. 38 5.3.1 Fisher Exact Tests ......................................................................................................... 38 5.3.2 Cluster Analysis and Heatmaps .................................................................................... 39 5.3.3 Logistic Regression ....................................................................................................... 41 v vi Feature Based Document Profiling ‐ a Key For Document Interoperability? 5.3.4 Recursive Partitioning Trees ........................................................................................

A Key for Document Interoperability?

XXX Format Assessment

Schematron Overview Excerpted from Leigh Dodds’ 2001 XSLT UK Paper, “Schematron: Validating XML Using XSLT”

Markup UK 2021 Proceedings

Xml Schema Viewer Eclipse

International Standard Iso/Iec 26300-2

Introduction to Schematron

ODF Workshop

Comparing Libreoffice and Apache Openoffice Libreoffice and Apache Openoffice Both Are Derived from the Former Openoffice.Org Project

XML in 10 Points

Open Document Format for Office Applications (Opendocument) Version 1.2

XML – Grammatiken Und Xforms Von Astrid Sackel

Libreoffice 7.1 Calc Guide | 3 Chart Wizard