ELAN Electronic Government and Applications
Feature Based Document Profiling - A Key for Document Interoperability? Bibliografische Information der Deutschen Nationalbibliothek:
Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.d-nb.deabrufbar.
1.Auflage Juni 2012
Alle Rechte vorbehalten © Fraunhofer-Institut für Offene Kommunikationssysteme FOKUS, Juni 2012
Fraunhofer-Institut für Offene Kommunikationssysteme FOKUS Kaiserin-Augusta-Allee31 10589 Berlin
Telefon: +49-30-3436-7115 Telefax: +49-30-3436-8000 [email protected] www.fokus.fraunhofer.de
Dieses Werk ist einschließlich aller seiner Teile urheberrechtlich geschützt. Jede Ver- wertung, die über die engen Grenzen des Urheberrechtsgesetzes hinausgeht, ist ohne schriftliche Zustimmung des Instituts unzulässig und strafbar. Dies gilt insbesondere für Vervielfältigungen, Übersetzungen, Mikroverfilmungen sowie die Speicherung in elektronischen Systemen. Die Wiedergabe von Warenbezeichnungen und Handels- namen in diesem Buch berechtigt nicht zu der Annahme, dass solche Bezeichnungen im Sinne der Warenzeichen-und Markenschutz-Gesetzgebung als frei zu betrachten wären und deshalb von jedermann benutzt werden dürften. Soweit in diesem Werk direkt oder indirekt auf Gesetze, Vorschriften oder Richt-linien (z.B. DIN, VDI) Bezug genommen oder aus ihnen zitiert worden ist, kann das Institut keine Gewähr für Richtigkeit, Vollständigkeit oder Aktualität übernehmen.
ISBN 978-3-00-038675-6
Feature Based Document Profiling ‐ a Key For Document Interoperability?
Authors
Dr. Klaus‐Peter Eckert Fraunhofer Institut FOKUS eMail: klaus‐[email protected]
Kerstin Goluchowicz Technische Universität Berlin, Fachgebiet Innovationsökonomie eMail: kerstin.goluchowicz@tu‐berlin.de
Dr. Stephan Gauch Technische Universität Berlin, Fachgebiet Innovationsökonomie eMail: stephan.gauch@tu‐berlin.de
Björn Kirchhoff eGov Consulting and Development GmbH eMail: [email protected]
i ii Feature Based Document Profiling ‐ a Key For Document Interoperability?
ii iii Feature Based Document Profiling ‐ a Key For Document Interoperability?
Management Summary
The working group WG5 of the ISO/IEC subcommittee SC34 “Document Description and Processing Languages” performs research about “Document Interoperability” considering open document standards such as “Open Document Format ‐ ODF” and “Office Open XML ‐ OOXML”. The TransDok project (validation and transformation of selected profiles of the document standards ISO/IEC 26300 and ISO/IEC 29500), sponsored by the German Federal Ministry of Economics and Technology contributes to this research. It examines, if and how feature based document profiles can be defined and used as a means to identify interoperable subsets of both document standards, especially for typical documents used in the German public sector.
Utilizing the document features identified in ISO/IEC TR 29166 (1), XML schema for the definition of document features and feature based profiles have been defined. A feature list generator has been implemented that creates a list of all features used within a document and in addition a list of all features including their relative and absolute occurrence in all documents contained in a given folder. The list can be used to identify those properties that are characteristic for the documents within the folder and to define an associated document profile.
The feasibility of feature based profiles to describe common properties of document types has been analysed using mathematical classification methods. These methods show that at least typical features for certain document types exist. These features can be used to define an interoperable profile or template for a document type. In case the set of features is restricted to those that are characteristic and necessary and that allow a unique translation between both document standards an important step towards document interoperability and translation has been done. The average accuracy of our classification algorithms reaches levels above 70%, making these approaches a viable complementary option to improve classification of documents.
If the ideas developed in the project are applied to typical document types in the German public sector, their interoperability and portability can be enhanced significantly. The integration of the feature list generator in archiving systems will enhance the likelihood of sustainable storage of documents and reduce interoperability problems significantly. The ideas developed in the project have been presented to ISO/IEC SC34 WG5 as well as to ODF plug fests. The results of the project are included in the current WG5 study period report and will probably influence the next work items in WG5.
The project underlying this report was funded by the German Federal Ministry of Economics and Technology under grant number 01FS10017. The responsibility for the content of this publication lies with the authors.
iii
v Feature Based Document Profiling ‐ a Key For Document Interoperability?
Contents
Management Summary ...... iii
Contents ...... v
1 Introduction ...... 1
1.1 Practical Relevance ...... 2
2 State of the Art ...... 3
2.1 Open Document Formats ...... 3 2.1.1 Introduction to OOXML ...... 4 2.1.2 Introduction to ODF ...... 6 2.2 Conformity and Interoperability Definitions ...... 8 2.2.1 Office Open XML ...... 8 2.2.2 OpenDocument Format ...... 10 2.2.3 Summary ...... 12 2.3 Document Features ...... 14 2.3.1 ISO/IEC TR 29166 ...... 15 2.4 Profiling and Document Interoperability ...... 20 2.5 Tools and Languages ...... 21 2.5.1 Document Packages ...... 21 3 Methodology ...... 22
3.1 Definition of Document Features ...... 22 3.2 Feature Based Profile Definition ...... 22 3.3 Profile Inspection ...... 24 3.3.1 Binary Membership ...... 24 3.3.2 Statistical Membership ...... 25 4 Technical Details ...... 26
4.1 Feature List Generator ...... 26 4.1.1 Using the Feature List Generator ...... 28 4.2 Profile Definition and Checking ...... 31 4.2.1 Definition and Testing of a Profile ...... 32 5 Profile Evaluation ...... 33
5.1 Dataset and Pre‐processing ...... 34 5.2 Testbed Specifications ...... 37 5.3 Classification Approaches ...... 38 5.3.1 Fisher Exact Tests ...... 38 5.3.2 Cluster Analysis and Heatmaps ...... 39 5.3.3 Logistic Regression ...... 41
v vi Feature Based Document Profiling ‐ a Key For Document Interoperability?
5.3.4 Recursive Partitioning Trees ...... 43 5.3.5 Neural Networks ...... 44 5.3.6 Support Vector Machines ...... 45 5.3.7 Discriminant Analysis ...... 45 5.4 Synopsis ...... 46
6 Summary ...... 48
6.1 Summary of Project results ...... 48 6.2 Practical Relevance ...... 48 6.2.1 Technical Relevance ...... 48 6.2.2 Economic Relevance ...... 48 6.3 Open Issues ...... 49 6.3.1 Scientific Challenges ...... 49 7 References ...... 50
vi 1 Introduction
1 Introduction
The working group WG5 of the ISO/IEC subcommittee SC34 “Document Description and Processing Languages” performs research about “Document Interoperability” considering open document standards such as “Open Document Format ‐ ODF” and “Office Open XML ‐ OOXML”. The first result of the working group is the publication of the ISO/IEC technical report TR 29166 on “Guidelines for translation between ISO/IEC 26300 and ISO/IEC 29500 document formats” late 2011 (1). This report defines a taxonomy of document features and evaluates, if these feature are supported by the two standards and if the implementations of the features can be mapped between the standards.
The TransDok project (validation and transformation of selected profiles of the document standards ISO/IEC 26300 and ISO/IEC 29500), sponsored by the German Federal Ministry of Economics and Technology, goes one step further. The project examines, if and how feature based document profiles can be defined and used as a means to identify interoperable subsets of both document standards. After several interviews with representatives from the German public sector and comprehensive Internet search a set of typical document categories for the German public sector has been identified and associated documents have been gathered and analysed.
Utilizing the document features identified in ISO/IEC TR 29166, XML schema for the definition of document features and for the definition of feature based profiles have been defined. The feature language has been applied to specify exemplary features of word processing documents utilizing XPath based detection function for both document standards. As a next step a feature list generator has been implemented. This generator creates a list of all features used within a document and in addition a list of all features including their relative and absolute occurrence in all documents contained in the given folder. The list can be used to identify those properties that are characteristic for the documents within the folder. For example typical features for a German application form can be identified.
The feature list generator has two additional properties. First a profile can be defined by assigning attributes like “may exist”, “must exist”, “must not exist” etc. to each feature. Second a document can be checked if it conforms to such a profile definition.
Following the idea of feature based profiles several questions arise. Is it possible to define profiles in a way that profiles characterizing different document types are really different? Is for example a feature based profile for a letter different from a profile for an application and what makes the difference? What is the likelihood that an arbitrary document conforms to a given profile? What is the likelihood that a document of a specific document type conforms to the associated profile or the other way round; what is the likelihood that such a document does not conform to a given profile? Is my letter really a letter with respect to a letter profile? If the intersection of two profiles P and Q is not empty, is it possible to say that a document d belongs to P or to Q, is it possible to say that the likelihood of is greater than ?
To answer such questions mathematical methods have been applied to our feature based profile definitions. These methods show that at least typical features for certain document types exist. These typical features can be used to define an interoperable profile or template for a given document type. In case the set of features is restricted to those that are characteristic and necessary for the document class and that allow a unique translation between both document standards an
1 2 Introduction important step towards document interoperability and translation has been done. The fact if a feature can be translated between the two standards can be derived from the associated detection functions. If for a given feature detection functions for each standard exist, these functions can be used to define a feature translation between the standards.
This report starts with a summary of the state of the art concerning conformity and interoperability definitions for the open document standards ODF and OOXML. Section 3 explains the methodology and mathematical approaches used in the project, followed by a description of the technical details of the feature list generator in section 4. Section 5 evaluates the profile idea utilizing mathematical methods from statistic distributions. An outlook concerning the practical importance of the work concludes the report.
1.1 Practical Relevance
The major goal of the TransDok project is to improve interoperability between documents implemented in ODF or OOXML respectively and to give guidelines, how document templates and office suites should be designed to enhance portability of documents. If the ideas developed in the project are applied to typical document types in the German public sector, their interoperability and portability of documents can be enhanced significantly. The integration of the feature list generator in archiving systems will enhance the likelihood of sustainable storage of documents and reduce interoperability problems significantly.
The ideas developed in the project have been presented to ISO/IEC SC34 WG5 as well as to ODF plug fests. The results of the project are included in the current WG5 study period report and will probably influence the next work items in WG5. For this reason the relevance for standardisation bodies such as ISO/IEC SC34 can be considered as high.
2 3 State of the Art
2 State of the Art
This section gives an introduction to the history and main concepts of Open Document Format (ODF) and Office Open XML (OOXML). It focusses on the definition of document features and the concepts for conformity, interoperability and profilingg introduced in both standards.
2.1 Open Document Formats
OASIS Open Document Format ODF 1.0 (ISO/IEC 26300) and Office Open XML (ISO/IEC 29500) are both open document formats for saving and exchanging word processing docuuments, spreadsheets and presentations. Both formats are XML based but differ in design and scope.
OASIS ODF 1.0 (2) was published by OASIS as an OASIS standard in May 2005. The second edition of ODF 1.0 has been published by OASIS as a committee specification in July 2006 and accepted as an International Standard by ISO (ISO/IEC 26300) (3) in December 2006.
Figure 1: Evolution of ODF (February 2012) ODF 1.1 (4) has been published as an OASIS standard in 2007 and will be published as Amendment 1 of ISO/IEC 26300:2006 (5) in 2012. ODF 1.2 hhas been published as an approved OOASIS Standard early 2012 (6) and will probably become a PAS1 submission in ISO/IEC in the same year.
Office Open XML was first approved as a five‐part standard in December 2006 by the ECMA International General Assembly as ECMA‐376. An updated version was published in November 2008 by ISO as ISO/IEC 29500:2008. The corresponding version, ECMA‐376 2nd edition (7), was published in December 2008. The consolidated version of OOXML including several corrigennda and amendments was pubblished in 2011 as ISO/IEC 29500:2011 and ECMA‐376 3rd edition (8).
1 PAS ‐ Publicly Available Specification
3 4 State of the Art
ECMA 376 - 1st ECMA 376 – 2nd ECMA 376 – 3rd 2006 2008 2011
Cor 1 - 2009 Cor 1 – 201? ISO 29500 Part 1 ISO 29500 Part 1 2008 2011 Amd 1 - 2010 Cor 1 - 2009 Amd 1 - 2011
Cor 1 - 2009 ISO 29500 Part 2 ISO 29500 Part 2 2008 2011
Cor 1 - 2009 ISO 29500 Part 3 ISO 29500 Part 3 2008 2011
Cor 1 - 2009 Cor 1 – 201? ISO 29500 Part 4 ISO 29500 Part 4 2008 2011 Amd 1 - 2010 Cor 1 - 2009 Amd 1 - 2011
Figure 2: Evolution of OOXML (February 2012)
2.1.1 Introduction to OOXML
OOXML is a four‐part standard consisting of:
1. Part 1 ‐ Fundamentals and Markup Language Reference (9). This part contains the strict specification of OOXML. Until the day of writing there exists no implementation of this part. • Conformance definitions • Textual descriptions of the documents parts respectively the document markup languages defined by the standard: WordprocessingML, PresentationML, SpreadsheetML and further supported MLs. • XML schemas for the document markup languages using XSD and (non‐normatively) RELAX NG • Several examples, tutorials and primers • A list of differences between this part and ECMA‐376 1st edition 2. Part 2 ‐ Open Packaging Conventions (10). This part contains: • A description of the Open Packaging Conventions e.g. package model and physical package • Core properties, thumbnails and digital signatures • XML schemas for the OPC using XSD and (non‐normatively) RELAX NG • Several examples and guidelines • A list of differences between this part and ECMA‐376 1st edition 3. Part 3 ‐ Markup Compatibility and Extensibility (11). This part contains: • A description of extensions: elements and attributes which define mechanisms allowing applications to specify alternative content • Extensibility rules using NVDL2
2 NVDL ‐ Namespace‐based Validation Dispatching Language ‐ ISO/IEC 19757 (14)
4 5 State of the Art
4. Part 4 ‐ Transitional Migration Features (12). This part contains the transitional specification of OOXML. Until the day of writing most OOXML applications implement this part. This part contains: • Legacy material such as compatibility settings and the graphics markup language VML; Textual descriptions of the documents parts respectively the document markup languages defined by the standard: WordprocessingML, PresentationML, SpreadsheetML and further supported MLs referring to part 1 of the standard whenever appropriate • XML schemas for the document markup languages using XSD and (non‐normatively) RELAX NG • A list of differences between this part and ECMA‐376 1st edition
2.1.1.1 WordprocessingML
OOXML defines three major markup languages that have been developed rather independent. For this reason the amount of shared concepts is quite small. For example Part 1 introduces in following WML concepts from which a model for text documents and their features can be derived:
• Paragraphs and Rich Formatting • Tables • Custom Markup • Sections • Styles • Fonts • Numbering • Headers and Footers • Footnotes and Endnotes • Glossary Documents • Annotations • Mail Merge • Settings • Fields and Hyperlinks
The following “smart art” shows, how a taxonomy for the properties of a text document can be defined using the feature definitions of OOXML part 1.
5 6 State of the Art
Paragraph
Paragraphs and Run Formatting
Tables Run Content
Custom Markup Style Properties
Sections Table Styles
Numbering Styles Styles OOXML WML Document
Fonts Paragraph Styles
Numbering / Run Styles Lists
Headers and Footers
Footnotes and Endnotes
...
Figure 3: Sample features of OOXML wordprocessing documents
2.1.2 Introduction to ODF
ODF 1.2 is a three‐part standard consisting of:
1. Part 1: OpenDocument Schema (13). This part defines the XML schema for office documents such as text documents, spreadsheets, charts and graphical documents like drawings or presentations. It specifies: • Document structure • Document metadata • Document content • Formatting elements • Data types and attributes (major part of the specification) • Normative RelaxNG schema definitions • Guidelines 2. Part 2: Recalculated Formula (OpenFormula) Format (14). This part defines the formula language for OpenDocument documents called OpenFormula. It specifies • Evaluator types • Formula processing model • Data types to be used in formulas • Expression syntax
6 7 State of the Art
• Standard operations and functions 3. Part 3: Packages (15).This part defines a package format for OpenDocument documents. It specifies • Package types • Package content • Manifest file • Digital signatures • Metadata • ZIP file structure (non normative)
2.1.2.1 Text document
ODF defines one major markup language that defines all elements of OpenDocument documents and all attributes of these elements. For this reason a text document is not specified by a separate markup language but is a document with a body containing office text as depicted in Figure 4.
office:text‐attlist
office:text‐content‐ prelude office:text office:text‐content‐ main office:drawing office:text‐content‐ epilogue office:presentation
office:body office:body‐content office:spreadsheet
office:chart
office:image
office‐database
Figure 4: OpenDocument text document Typical content of a text document consists of:
• Text content such as headings, paragraphs, lists, or change tracking • Paragraph element content such as basic text, bookmarks and references, or notes • Text fields such as variable field or metadata • Text indices such as table of contents
7 8 State of the Art
• Tables such as basic tables or spreadsheets • Graphic content such as shapes, frames, animations • Chart content • Database front‐end content • Form content • Styles • Formatting elements
From this list a taxonomy for ODF text documents can be derived. To compare and map ODF documents to similar OOXML documents and vice versa it is necessary to define a common super model of both taxonomies or to define subsets of both taxonomies whose elements can be mapped in an unambiguous way. The idea to define feature based document profiles follows the second approach.
2.2 Conformity and Interoperability Definitions
Due to the existence of two open document formats ODF (OpenDocument Format) and OOXML (Office Open XML) many discussions about
• interoperability between the standards,
• conformity of documents and
• conformity of applications such as office suites, documents producers and consumers have been started. It is necessary to have a look at the precise definitions of these terms within the standards to be able to discuss these issues on a well‐defined basis and to come to common conclusions acceptable by the users of documents and office suites as well as by the developers of standards and office suites. The introduction of document profiles is impossible without a common understand of these basis terms and the corresponding concepts.
The relevant definitions about standard conformity and interoperability can be retrieved from ISO/IEC 29500:2008/2011 (respectively ECMA‐376 2nd (7) and 3rd (8) editions), ODF 1.2 Approved OASIS Standard (6), the ODF 1.1 Interoperability Profile (16) and the ODF state of interoperability committee specification (17). The statements about conformity and interoperability in ISO/IEC 29500:2011 are mostly similar to those in the 2008 version.
The purpose of these sections is to provide an overview on conformity and interoperability definitions for the two document formats. This overview helps to derive the definition of property based document profiles depends on the corresponding concepts in both standards.
2.2.1 Office Open XML
This section introduces excerpts from the ISO/IEC 29500:2008/2011 versions of the OOXML specifications which have been officially published in fall 2008 respectively 2011.
2.2.1.1 Application Descriptions
OOXML currently does not explicitly define the term “profile”. Instead an OOXML‐application can be defined as conforming to zero or more application descriptions in a particular conformance class.
8 9 State of the Art
The application descriptions defined within ISO/IEC 29500 are:
• Base ‐ An application conforming to this description has a semantic understanding of at least one feature within its conformance class. In addition, applications that include a user interface are strongly recommended to support all accessibility features appropriate to that user interface. • AnFull ‐ application conforming to this description has a semantic understanding of every feature within its conformance class.
2.2.1.2 Conformance Classes
The above mentioned application conformance classes must fulfil the following conditions:
• Existence of W3C XML schemas and an associated validation procedure for validating document syntax against those schemas. • Existence of additional syntax constraints that could not feasibly be expressed in the schema language in written form. • Existence of descriptions of XML element semantics. The semantics of an XML element refers to its intended interpretation by a human being.
An application is of conformance class WML/SML/PML3‐strict/transitional, if the application is a conforming application that is a consumer or producer of documents having conformance class WML/SML/PML‐strict/transitional. An application description should provide a machine‐processable schema, preferably using a member of the multipart standard ISO/IEC 19757 (18) that defines Document Schema Definition Languages (DSDL) such as RelaxNG, Schematron and the Namespace‐ based Validation Dispatching Language (NVDL)
A document conformance class refers to the appropriate W3C XML schemas and additional syntax constraints used to specify WML/SML/PML‐strict/transitional documents.
The standard assumes that additional application descriptions will be defined within the maintenance process for OOXML. It is also expected dthat thir parties might define their own application descriptions. Application descriptions would promote interoperability between applications implementing OOXML. They would also promote interoperability between applications implementing OOXML and applications implementing other document formats such as ODF.
2.2.1.3 Summary
Summarizing, the standard states that applications can conform to application descriptions based on feature definitions and document conformance classes. The intention of an application description is to promote interoperability between different applications that share the same conformance class. Following this idea, an OOXML document profile can be defined as a set of features within a document conformance class.
3 WML ‐ Wordprocessing Markup Language SML ‐ Spreadsheet Markup Language PML ‐ Presentation Markup Language
9 10 State of the Art
It is worth mentioning that the document conformance statement has been technically refined considering OPC4 and EMC in the first technical corrigendum (5) to ISO/IEC 29500‐1:2011 and considering VML5 in the first technical corrigendum (19) to ISO/IEC 29500‐4:2011. Additionally the interoperable generation and consumption of EMC extension lists have specified in a precise way.
Part 1 of ISO/IEC 29500 defines interoperability guidelines. These guidelines state that software applications should be accompanied by documentation that describes what subset of ISO/IEC 29500 they support. The documentation should highlight any behaviour that may violate the semantics of the document’s XML elements. It has to be ensured that for all operations on the XML elements defined in ISO/IEC 29500 that are implemented by the application the semantics for that XML element is consistent with ISO/IEC 29500. If the application moves, adds, modifies, or removes XML element instances with the effect of altering document semantics, it should declare the behaviour in its documentation.
2.2.2 OpenDocument Format
The OpenDocument specification ODF 1.2 (6) defines conformance for documents, consumers, and producers, with two conformance classes called conforming and extended conforming.
2.2.2.1 Conformance Classes
An ODF document of conformance class conforming shall be a conforming OpenDocument package and it shall conform to one of: OpenDocument Text Document, OpenDocument Spreadsheet Document, OpenDocument Drawing Document, OpenDocument Presentation Document, OpenDocument Chart Document, OpenDocument Image Document, OpenDocument Formula Document, OpenDocument Database Front End Document. An OpenDocument xyz Document is characterized by the existence of an
An ODF document of conformance class extended conforming shall be a conforming ODF extended package and may contain additional foreign elements and attributes as specified by the standard.
2.2.2.2 ODF Producer
An OpenDocument producer is a program that creates at least one conforming OpenDocument document, and that may produce conforming OpenDocument extended documents, but it shall have a mode of operation where all OpenDocument documents that are created are conforming OpenDocument documents. The program shall be accompanied by a document that defines all implementation‐defined values used by the OpenDocument producer.
An OpenDocument extended producer is a program that creates at least one conforming OpenDocument extended document, and that shall be accompanied by a document that
• defines all implementation‐defined values used by the OpenDocument extended producer and that • defines all foreign elements and attributes used eby th OpenDocument extended producer.
4 OPC ‐ Open Packaging Conventions (10) MCE ‐ Markup Compatibility and Extensibility (11) 5 VML ‐ Vector Markup Language
10 11 State of the Art
2.2.2.3 ODF Consumer
An OpenDocument consumer is a program that can parse and interpret OpenDocument documents according to the semantics defined by this standard that meets the following additional requirements:
• It shall be able to parse and interpret OpenDocument documents of one or more of the document types defined by the standard but it need not interpret the semantics of all elements, attributes and attribute values. • It shall interpret those elements and attributes it does interpret consistent with the semantics de‐fined for the element or attribute by the standard. • It should be able to parse and interpret conforming OpenDocument extended documents, but it need not interpret the semantics of all elements, attributes and attribute values.
2.2.2.4 Expressions and Evaluators
The ODF standard defines conformance for formula expressions and evaluators. An OpenDocument Formula Evaluator is a program that can parse and recalculate OpenDocument formula expressions. ODF distinguishes three groups of features that an evaluator may support. It shall conform to OpenDocument Formula Small Group Evaluator, OpenDocument Formula Medium Group Evaluator or OpenDocument Formula Large Group Evaluator. The three groups support formula expression with different types and complexity together with a different amount of math functions. For example the small group supports data of type text, integer and float numbers, logical and a basic set of corresponding functions. The medium group supports more functions and the large group supports complex numbers and corresponding functions. An evaluator may implement additional functions beyond those defined in ODF. It may further implement additional formula syntax, additional operations, or additional optional parameters for functions. Evaluators should clearly document their extensions in their user documentation, both online and paper, in a manner so users would be likely to be aware when they are using a non‐standard extension.
2.2.2.5 ODF Conformance and Interoperability
The OASIS “Open Document Format Interoperability and Conformance (OIC) TC” states in its paper on ODF interoperability (17) “that conformance is the relationship between a product and a standard. A standard defines provisions that constrain the allowable attributes and behaviours of a conforming product. Some provisions define mandatory requirements, meaning requirements that all conforming products must satisfy, while other provisions define optional requirements, meaning that where applicable they must be satisfied. Conformance exists when the product meets all of the mandatory requirements defined by the standard, as well as those applicable optional requirements… A standard may define requirements for one or more conformance targets in one or more conformance classes.”
Since the capabilities of office applications extend beyond simple desktop editor and include other product categories such as web‐based editors, mobile device editors, document convertors, content repositories, search and indexing engines, and other document‐aware applications, interoperability will mean different things to users of these different applications. However focussing on office
11 12 State of the Art applications interoperability consists of meeting user expectations regarding one or more of the following qualities6 when transferring documents:
• visual appearance of the document at various levels • structure of the document as revealed when the user attempts to edit the document • behaviours and capabilities of internal and external links and references • behaviours and capabilities of embedded images, media and other objects • preservation of document metadata • preservation of document extensions • integrity of digital signatures and other protection mechanisms • runtime behaviours manifest from scripts, macros and other forms of executable logic
The focus on the user’s expectations leads to the ODF interoperability model shown in Figure 5. This model introduced in )(17 defines document interoperability as the degree of analogy between the author’s intension and the reader’s perception.
• Author‘s intentions
•Application A‘s encoding (author)
•Document – standardized storage format
•Application B‘s decoding (reader)
•Reader‘s perceptions
Figure 5: ODF interoperability model The ODF 1.1 Interoperability Profile Committee Draft (16) clarifies and formalizes interpretations of the ODF 1.1 specification by creating an Interoperability Profile that adds conformance constraints to the specification. It is currently not intended by OASIS to specify profiles that restrict the ODF standard for specific application areas.
2.2.3 Summary
2.2.3.1 Conformity
Both document formats introduce conformance considering supported document types as shown in Table 1. OOXML distinguishes conformity with respect to strict and transitional markup languages. Such a distinction is not necessary for ODF. ODF conformity is based on schema validity, OOXML conformity is based on schema validity together with additional "written syntax constraints".
Table 1: Document types ODF OOXML Text Wordprocessing WML Spreadsheet Spreadsheet SML
6 These qualities are an example for the document features discussed in section 2.3.
12 13 State of the Art
Drawing Presentation Presentation PML Chart Image Formula Database front end
Application conformance is defined according to document conformance. Both formats distinguish document consumer and producer.
• A conforming consumer shall o OOXML: not reject any conforming documents of at least one document conformance class o ODF: be able to parse and interpret ODF documents of one or more of the document types • A conforming producer shall be able to produce: o OOXML: conforming documents of at least one document conformance class o ODF: at least one conforming document
Both definitions seem to be equivalent; even though they differ in the wording.
2.2.3.2 Extended Conformity
In ODF, a document can contain content that is not schema valid with respect to the ODF schema definitions. An ODF document is an element of the conformance class extended conforming if the document is an element of conforming after removal of the non‐ODF parts. In addition, the ODF specification introduces explicitly conformance classes for formula expressions and evaluators, based on the OpenFormula specification.
In OOXML the extension mechanism MCE is described in Part 3. A document is an element of the conformance class MCE, if it satisfies the corresponding syntax constraints on elements and attributes. This definition is more restrictive than the ODF definition.
2.2.3.3 Package Conformance
In OOXML, a document is of conformance class OPC if it obeys all corresponding syntactic constraints. An ODF file conforms to part 3 of the ODF specification if it is a zip‐file satisfying the corresponding constraints.
2.2.3.4 Profiling and Interoperability
OOXML introduces the concepts of an application description without elaborating the concept within the current version of the standard. Application descriptions should be used to refine the standard and to improve interoperability between different implementations as well as between OOXML and ODF.
ODF addresses interoperability issues in the OIC TC and publishes papers about interoperability (17) and an interoperability profile (16). The focus of ODF profiles is to improve the interoperability between ODF applications.
13 14 State of the Art
2.2.3.5 Conclusion
Both document standards include conformance statements for documents, documents producers and document consumers. While ODF focuses mainly on syntax related criteria and schema validity, OOXML considers textual syntax constrains and semantic aspects, too. Unfortunately, these definitions are rather weak and don't allow a precise definition and validation of conformance properties. For this reason both standards require to provide additional written documentation in case of implementation dependent solutions and any extension of the standard.
Interoperability and profiling are only tackled to a limited extent and have to be improved by both standards bodies OASIS and ISO in the future.
2.3 Document Features
There exist two major approaches for the identification of document features in ODF and OOXML. The Pentaformat (20) introduces the concept of “pattern based segmentation of structured content” to express the most used and meaningful structures of digital documents. An abstract document model has been developed using the following set of basic patterns together with five characteristics:
• A marker is an empty element, in case enriched with attributes • An atom is a markup unit of information • A block contains text streams and unordered and repeated nested elements • A record is a container of heterogeneous information, organized in a set of optional elements • A table is an ordered list of homogeneous elements • A container is an unordered sequence of repeatable and heterogeneous elements • An additive context is a context where a few elements are added in depth to existing elements • A subtractive context is at contex where some elements that would normally be allowed make no sense
The five dimensions for documents are:
• Content: what conveys semantics to the document • Presentation: what defines the visual aspect of the document • Structure: what provides organization and link the content to all the rest • Behaviour: what defines the dynamics of a document in an active environment • Metadata: what describes the document independent of its content
Following the approach depicted in Figure 6 mapping strategies between different document formats have been defined.
14 15 State of the Art
Figure 6: The Pentaformat7 Although the Pentaformat allows separating different aspects, in a translation operation the three situations can always occur:
1. Both formats support the same feature 2. The target format partially supports the feature 3. The target format does not support the feature
For a one way mapping either
• a feature can be translated by syntactical transformations or • a workaround solution has to be implemented ‐ a feature can be implemented using a combination of different features of the target format ‐ or • the feature cannot be translated at all
For roundtrip translation the Pentaformat suggests to include information about the feature representation in the source format as hidden or metadata in the target document.
2.3.1 ISO/IEC TR 29166
The ISO/IEC technical report TR 29166:2011 (1) aims at analysing ISO/IEC 26300:2006 and ISO/IEC 29500:2008 and their underlying concepts in terms of interoperability issues for a selected set of features. It analyses the way these features are implemented in both International Standards and estimates the degree of translatability between them using a table based comparison of document features and functionalities. ISO/IEC 29166 starts by studying common use cases to identify how the most important features and the implementation of the corresponding functionality in one document format can be drepresente in the other format. This is followed by a thorough review of the concepts, architectures and various features of the two document formats in order to provide a
7 Excerpt from (21)
15 16 State of the Art good understanding of the commonalities and differences. It is shown that functionalities will be able to be translated with different degrees of fidelity between the two formats. ISO/IEC 29166 provides, for an illustrative sample of this translatability, detailed information on the extent in which selected functionalities can be translated.
Document properties
Presentation instructions
Document content
Dynamic content
Meta data
Annotations and security
Document parts
Figure 7: Document properties in ISO/IEC TR 29166 The document properties introduced in the report and depicted in Figure 7 can be compared with the five dimensions of the Pentaformat. In detail they are defined as follows:
• Presentation instructions include all layout and presentation related information such as fonts, spacing, margins, colours, paper layout and settings, and animation in office documents. • Document content covers all properties of content (such as text, graphics and formulas) defined directly by the author of a document. • Dynamic content covers all aspects of automatically generated content including calculations or form functionalities such as fields, generated tables or dynamic references. • Metadata cover all information apart from the core document content. Metadata are used to describe meta information about the document such as the generator, version, authors and to ensure the accessibility of documents, for instance by using certificates. • Annotation and security covers all aspects of annotations used in a document including comments, change tracking, collaborative functions and security features such as encryption and access control. • Document parts cover all aspects (editing semantics) of structural document properties such as paragraphs, headings, headers, footers, tables, lists, footnotes, indices and captions.
16 17 State of the Art
Figure 8: Document features and functionality in ISO/IEC TR 29166 The report identifies specific features that are used to implement the six document properties for the three types: wordprocessing, presentation and spreadsheet documents. Seelected features and functions identified for wordprocessing documents are assembled in Table 2 beelow. This assembly is used as a starting point for the definition of feature based profiles as described in section 3.1.
Table 2: Selected features and functionalities of wordprocessing documents
Feature Functionalitty
Text formatting
Bold text (font weight)
Text borders
Whitespaces
Capitalization
Text colour
Complex script support
East‐Asian teext
Font selection
Font effectss
Manual speccification of run/ span width
Italic text
Kerning
Text language
Enable/ disable spell checking for run/ span
Raised/ loweered text
Strikethrough
Underline
Text colour
Complex script support
East‐Asian teext
Font selection
17 18 State of the Art
Feature Functionality
Font effects
Manual specification of run/ span width
Italic text
Kerning
Text language
Enable/ disable spell checking for run/ span
Raised/ lowered text
Strikethrough
Underline
Paragraph formatting
Line height
Text alignment (left/ right/ centered/ justified)
Keep paragraph on same page as following paragraph
Do not split paragraph into multiple pages
Tab stops
Hyphenation
Drop Caps
Register truth (same text line distance across multiple pages / columns)
Margins
First line indent
Page/ column break
Background colour
Background pattern
Background image
Embedded Images
Borders
Padding
Shadow
Line numbering
Vertical alignment (top, middle, bottom, baseline)
Asian / complex text layout properties
Writing mode (lr/rl/tb)
Text frames
Lists
Header and Footer
Content type
Properties
18 19 State of the Art
Feature Functionality Formatting
Tables
Table properties
Data alignment
Column settings
Row settings
Cell settings
Sub tables
Borders
Table headings
Itemization and numeration
Numbered Lists
Bullet lists
Nested lists
Captions
Indices
Table of contents
Table of figures
Table of tables
User defined indices
Bibliographies
Hyperlinks and references
Change tracking
Annotations
Text insertion
Text deletion
Formatting changes
Comments
Text highlighting
Metadata
Graphics
Embedded graphics
Vector graphics
Forms
Charts
Embedded data
Mail merge
19 20 State of the Art
2.4 Profiling and Document Interoperability
Profiling is a well‐known concept in standardisation. In case a standard is too complex or not intended to be implemented as a whole, self‐contained subsets of the standard are identified that guarantee interoperability between different implementations of the profile. How a subset of a standard that should be profiled is chosen depends on the given problem. The ISO concept database ‐ ISO 14772 (21) defines a profile (in the context of virtual reality modelling languages) as a “named collection of criteria for functionality and conformance that defines an implementable subset of the standard”.
As explained in section 2.2 neither OOXML nor ODF has introduced profiles within the standard. Nevertheless both standards refer to similar concepts that should improve interoperability of different implementations of one standard and between both standards. The definition of a common profile for two standards such as OOXML and ODF requires a common understanding of the standardized artefact, in our case of a “document”. As explained in section 2.3 such a common understanding can be achieved by the introduction of a document metamodel. The feature based approach introduced in this paper can be used to define such a metamodel.
Assuming the existence of a standardized, agreed metamodel withd standardize mappings to OOXML and ODF it is possible to define translation rules for common profiles between the two International Standards. Following the ideas of “Model Driven Architectures” (MDA) the metamodel corresponds to a “Platform Independent Model” (PIM) that can be mapped to two different “Platform Specific Models” (PSM). In case a reverse mapping from PSM to PIM exists a PSM‐document can be analysed and re‐mapped to a PIM‐document. This PIM‐document can be mapped to a PSM‐document defined in another “platform”. As shown in Figure 9 a feature translation can be implemented concatenating the two operations “feature detection” and “feature implementation”.
Document Meta model
Feature Feature f 2. Feature implementation implementation 1. Feature detection
Format A Format B
Feature f Feature f
Feature translation
Figure 9: Feature based document translation Section 3 shows how feature based document profiles can be defined and used to define translation rules utilizing feature detection and implementation functions. It has been out of scope of the TransDok project to define a document metamodel or interoperable document profiles. The project focuses on the development of concepts and tools to specify such profiles and especially to identify
20 21 State of the Art the characteristic features of given documents. The specification of such profiles is a typical task of standardization bodies or application domain specific communities.
One lesson learned from MDA approaches is that reverse engineering from a PSM to a PIM is eased by the inclusion of trace information from the PIMÆPSMg mappin in the PSM artefact. The introduction of such “feature annotations” in document formats would help a lot to achieve interoperability with respect to a given PIM document metamodel. Unfortunately neither OOXML nor ODF support such annotations. Such feature annotations are not necessary in case a document format supports a feature in a native and unambiguous way. In case a feature has to be “implemented” in a document format, the feature has to be composed of other, native features, an associated annotation is necessary to detect the semantics of the implementation during the feature detection process. A typical interoperability profile for OOXML and ODF will probably consist of features that are natively supported by both document formats.
The introduction of an object based or even object oriented document metamodel will go one step beyond the introduction of a feature based metamodel. An object based metamodel allows to define an object (document) as a set of objects (parts) together with the operations used on these objects. Such a document can be stored following the taxonomy of the object model and easily be mapped to existing document formats such as OXML or ODF. Again, these ideas go far beyond the scope of the TransDok project.
2.5 Tools and Languages
The project develops a prototypic implementation of the approach presented in this paper strictly utilizing standardized XML technologies such as ISO Schematron (18), XProc (22) and XSLT (23).
2.5.1 Document Packages
ODF and OOXML are both file formats that use ZIP‐archives as containers. These so called packages are containing sets of XML files. In order to check the profile conformance of a document, the content of the relating container has to be analysed. Common XML‐technologies were developed to operate on single XML instance documents. Consequently a validation of full packages makes it necessary to validate multiple XML documents contained in the package. In order to create an overall cumulated validation result, sequences of validation‐ and transformation steps have to be performed. The XML pipeline language XProc (22), that is a W3C recommendation since May 2010, allows the composition of such processes.
Even though an in‐place validation of ODF and OOXML is possible by using custom URI Resolvers e.g. supporting the not standardized JAR‐URL‐Syntax introduced by Oracle (24), the project used XProc to transform packages into a simple “envelope format” very much like a suggestion made by Rick Jelliffe on the ISO/IEC JTC 1/SC 34 mailing list:
„I think that all that is needed is a simple vocabulary with zip:archive, zip:folder and zip:entry. Non‐XML files could have an empty file with their name, to allow validation that a link points to the appropriate media etc.“ (25)
This flat representation of a package, i.e. a single XML instance document, has the advantage that common XML transformation and validation technologies can easily be adopted.
21 22 Methodology
3 Methodology
The specific scope of the TransDok project is the application of document profiling ideas to typical documents that can be found in the German Public Sector. For this reason in the first phase of the project interviews and a workshop with representatives from German municipalities, federal states and ministries have been performed. Typical document types identified in this phase are applications, minutes, offers, letters, invitations etc. Unfortunately the amount of documents that has been submitted to the project was too small for statistical analyses. Surprising was the fact that the major number of documents was stored using Microsoft’s old binary formats. Only few documents use OOXML or ODF. For this reason Internet search (crawling) for the identified document types in specific German domains has been done to retrieve a sufficient amount of documents.
These documents have been analysed with respect to a subset of important document features. The tools developed in the tprojec support this analysis using associated feature lists. In addition they support the definition and inspection of profiles. One major result during the technical work in the project was to detect that the mathematical validation of the idea to use feature based profiles to support document conformance, portability and interoperability is of high importance. Feature based profiles can only be used if the allow to separate different document types based on characteristic features. As a conclusion it seems to be meaningful to express the membership of a document to a profile using a statistical likelihood instead of a binary decision. More details about this approach is given in section 5
3.1 Definition of Document Features
The definition of document features in the TransDok project has been done in three steps. In a first step the documents that have been identified and submitted in the interviews and workshop have been analysed to identify domain specific features. For example official minutes have to support features such as headings containing text and graphics, tables, change tracking or digital signatures. In a second step these features have been compared with the features identified in the ISO TR 29166. As a result the domain specific features have been mapped to associated document features that are supported by OOXML and/or ODF. In the third step associated detection functions for these “feature candidates” have been defined using XML technologies. The set of feature candidate was used to define the feature list that itself was used as one input for the Feature List Generator. Details are explained in section 4.
3.2 Feature Based Profile Definition
In order to formalize the profile definition and validation of documents, some corresponding mathematical artefacts have to be defined. These definitions are based on the work presented in (26).
Let ( ) and ( ) be defined as the sets of all conformant documents according to the conformance definitions in ODF respectively in OOXML.
A standard validator is a function
22 23 Methodology