ELAN Electronic Government and Applications

Feature Based ­Document ­Profiling - A Key for ­Document Interoperability? Bibliografische Information der Deutschen Nationalbibliothek:

Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.d-nb.deabrufbar.

1.Auflage Juni 2012

Alle Rechte vorbehalten © Fraunhofer-Institut für Offene Kommunikationssysteme FOKUS, Juni 2012

Fraunhofer-Institut für Offene Kommunikationssysteme FOKUS Kaiserin-Augusta-Allee31 10589 Berlin

Telefon: +49-30-3436-7115 Telefax: +49-30-3436-8000 [email protected] www.fokus.fraunhofer.de

Dieses Werk ist einschließlich aller seiner Teile urheberrechtlich geschützt. Jede Ver- wertung, die über die engen Grenzen des Urheberrechtsgesetzes hinausgeht, ist ohne schriftliche Zustimmung des Instituts unzulässig und strafbar. Dies gilt insbesondere für Vervielfältigungen, Übersetzungen, Mikroverfilmungen sowie die Speicherung in elektronischen Systemen. Die Wiedergabe von Warenbezeichnungen und Handels- namen in diesem Buch berechtigt nicht zu der Annahme, dass solche Bezeichnungen im Sinne der Warenzeichen-und Markenschutz-Gesetzgebung als frei zu betrachten wären und deshalb von jedermann benutzt werden dürften. Soweit in diesem Werk direkt oder indirekt auf Gesetze, Vorschriften oder Richt-linien (z.B. DIN, VDI) Bezug genommen oder aus ihnen zitiert worden ist, kann das Institut keine Gewähr für Richtigkeit, Vollständigkeit oder Aktualität übernehmen.

ISBN 978-3-00-038675-6

Feature Based Document Profiling ‐ a Key For Document Interoperability?

Authors

Dr. Klaus‐Peter Eckert Fraunhofer Institut FOKUS eMail: klaus‐[email protected]

Kerstin Goluchowicz Technische Universität Berlin, Fachgebiet Innovationsökonomie eMail: kerstin.goluchowicz@tu‐berlin.de

Dr. Stephan Gauch Technische Universität Berlin, Fachgebiet Innovationsökonomie eMail: stephan.gauch@tu‐berlin.de

Björn Kirchhoff eGov Consulting and Development GmbH eMail: [email protected]

i ii Feature Based Document Profiling ‐ a Key For Document Interoperability?

ii iii Feature Based Document Profiling ‐ a Key For Document Interoperability?

Management Summary

The working group WG5 of the ISO/IEC subcommittee SC34 “Document Description and Processing Languages” performs research about “Document Interoperability” considering open document standards such as “Open Document Format ‐ ODF” and “Office Open XML ‐ OOXML”. The TransDok project (validation and transformation of selected profiles of the document standards ISO/IEC 26300 and ISO/IEC 29500), sponsored by the German Federal Ministry of Economics and Technology contributes to this research. It examines, if and how feature based document profiles can be defined and used as a means to identify interoperable subsets of both document standards, especially for typical documents used in the German public sector.

Utilizing the document features identified in ISO/IEC TR 29166 (1), XML schema for the definition of document features and feature based profiles have been defined. A feature list generator has been implemented that creates a list of all features used within a document and in addition a list of all features including their relative and absolute occurrence in all documents contained in a given folder. The list can be used to identify those properties that are characteristic for the documents within the folder and to define an associated document profile.

The feasibility of feature based profiles to describe common properties of document types has been analysed using mathematical classification methods. These methods show that at least typical features for certain document types exist. These features can be used to define an interoperable profile or template for a document type. In case the set of features is restricted to those that are characteristic and necessary and that allow a unique translation between both document standards an important step towards document interoperability and translation has been done. The average accuracy of our classification algorithms reaches levels above 70%, making these approaches a viable complementary option to improve classification of documents.

If the ideas developed in the project are applied to typical document types in the German public sector, their interoperability and portability can be enhanced significantly. The integration of the feature list generator in archiving systems will enhance the likelihood of sustainable storage of documents and reduce interoperability problems significantly. The ideas developed in the project have been presented to ISO/IEC SC34 WG5 as well as to ODF plug fests. The results of the project are included in the current WG5 study period report and will probably influence the next work items in WG5.

The project underlying this report was funded by the German Federal Ministry of Economics and Technology under grant number 01FS10017. The responsibility for the content of this publication lies with the authors.

iii

v Feature Based Document Profiling ‐ a Key For Document Interoperability?

Contents

Management Summary ...... iii

Contents ...... v

1 Introduction ...... 1

1.1 Practical Relevance ...... 2

2 State of the Art ...... 3

2.1 Open Document Formats ...... 3 2.1.1 Introduction to OOXML ...... 4 2.1.2 Introduction to ODF ...... 6 2.2 Conformity and Interoperability Definitions ...... 8 2.2.1 Office Open XML ...... 8 2.2.2 OpenDocument Format ...... 10 2.2.3 Summary ...... 12 2.3 Document Features ...... 14 2.3.1 ISO/IEC TR 29166 ...... 15 2.4 Profiling and Document Interoperability ...... 20 2.5 Tools and Languages ...... 21 2.5.1 Document Packages ...... 21 3 Methodology ...... 22

3.1 Definition of Document Features ...... 22 3.2 Feature Based Profile Definition ...... 22 3.3 Profile Inspection ...... 24 3.3.1 Binary Membership ...... 24 3.3.2 Statistical Membership ...... 25 4 Technical Details ...... 26

4.1 Feature List Generator ...... 26 4.1.1 Using the Feature List Generator ...... 28 4.2 Profile Definition and Checking ...... 31 4.2.1 Definition and Testing of a Profile ...... 32 5 Profile Evaluation ...... 33

5.1 Dataset and Pre‐processing ...... 34 5.2 Testbed Specifications ...... 37 5.3 Classification Approaches ...... 38 5.3.1 Fisher Exact Tests ...... 38 5.3.2 Cluster Analysis and Heatmaps ...... 39 5.3.3 Logistic Regression ...... 41

v vi Feature Based Document Profiling ‐ a Key For Document Interoperability?

5.3.4 Recursive Partitioning Trees ...... 43 5.3.5 Neural Networks ...... 44 5.3.6 Support Vector Machines ...... 45 5.3.7 Discriminant Analysis ...... 45 5.4 Synopsis ...... 46

6 Summary ...... 48

6.1 Summary of Project results ...... 48 6.2 Practical Relevance ...... 48 6.2.1 Technical Relevance ...... 48 6.2.2 Economic Relevance ...... 48 6.3 Open Issues ...... 49 6.3.1 Scientific Challenges ...... 49 7 References ...... 50

vi 1 Introduction

1 Introduction

The working group WG5 of the ISO/IEC subcommittee SC34 “Document Description and Processing Languages” performs research about “Document Interoperability” considering open document standards such as “Open Document Format ‐ ODF” and “Office Open XML ‐ OOXML”. The first result of the working group is the publication of the ISO/IEC technical report TR 29166 on “Guidelines for translation between ISO/IEC 26300 and ISO/IEC 29500 document formats” late 2011 (1). This report defines a taxonomy of document features and evaluates, if these feature are supported by the two standards and if the implementations of the features can be mapped between the standards.

The TransDok project (validation and transformation of selected profiles of the document standards ISO/IEC 26300 and ISO/IEC 29500), sponsored by the German Federal Ministry of Economics and Technology, goes one step further. The project examines, if and how feature based document profiles can be defined and used as a means to identify interoperable subsets of both document standards. After several interviews with representatives from the German public sector and comprehensive Internet search a set of typical document categories for the German public sector has been identified and associated documents have been gathered and analysed.

Utilizing the document features identified in ISO/IEC TR 29166, XML schema for the definition of document features and for the definition of feature based profiles have been defined. The feature language has been applied to specify exemplary features of word processing documents utilizing XPath based detection function for both document standards. As a next step a feature list generator has been implemented. This generator creates a list of all features used within a document and in addition a list of all features including their relative and absolute occurrence in all documents contained in the given folder. The list can be used to identify those properties that are characteristic for the documents within the folder. For example typical features for a German application form can be identified.

The feature list generator has two additional properties. First a profile can be defined by assigning attributes like “may exist”, “must exist”, “must not exist” etc. to each feature. Second a document can be checked if it conforms to such a profile definition.

Following the idea of feature based profiles several questions arise. Is it possible to define profiles in a way that profiles characterizing different document types are really different? Is for example a feature based profile for a letter different from a profile for an application and what makes the difference? What is the likelihood that an arbitrary document conforms to a given profile? What is the likelihood that a document of a specific document type conforms to the associated profile or the other way round; what is the likelihood that such a document does not conform to a given profile? Is my letter really a letter with respect to a letter profile? If the intersection of two profiles P and Q is not empty, is it possible to say that a document d belongs to P or to Q, is it possible to say that the likelihood of is greater than ?

To answer such questions mathematical methods have been applied to our feature based profile definitions. These methods show that at least typical features for certain document types exist. These typical features can be used to define an interoperable profile or template for a given document type. In case the set of features is restricted to those that are characteristic and necessary for the document class and that allow a unique translation between both document standards an

1 2 Introduction important step towards document interoperability and translation has been done. The fact if a feature can be translated between the two standards can be derived from the associated detection functions. If for a given feature detection functions for each standard exist, these functions can be used to define a feature translation between the standards.

This report starts with a summary of the state of the art concerning conformity and interoperability definitions for the open document standards ODF and OOXML. Section 3 explains the methodology and mathematical approaches used in the project, followed by a description of the technical details of the feature list generator in section 4. Section 5 evaluates the profile idea utilizing mathematical methods from statistic distributions. An outlook concerning the practical importance of the work concludes the report.

1.1 Practical Relevance

The major goal of the TransDok project is to improve interoperability between documents implemented in ODF or OOXML respectively and to give guidelines, how document templates and office suites should be designed to enhance portability of documents. If the ideas developed in the project are applied to typical document types in the German public sector, their interoperability and portability of documents can be enhanced significantly. The integration of the feature list generator in archiving systems will enhance the likelihood of sustainable storage of documents and reduce interoperability problems significantly.

The ideas developed in the project have been presented to ISO/IEC SC34 WG5 as well as to ODF plug fests. The results of the project are included in the current WG5 study period report and will probably influence the next work items in WG5. For this reason the relevance for standardisation bodies such as ISO/IEC SC34 can be considered as high.

2 3 State of the Art

2 State of the Art

This section gives an introduction to the history and main concepts of Open Document Format (ODF) and Office Open XML (OOXML). It focusses on the definition of document features and the concepts for conformity, interoperability and profilingg introduced in both standards.

2.1 Open Document Formats

OASIS Open Document Format ODF 1.0 (ISO/IEC 26300) and Office Open XML (ISO/IEC 29500) are both open document formats for saving and exchanging word processing docuuments, and presentations. Both formats are XML based but differ in design and scope.

OASIS ODF 1.0 (2) was published by OASIS as an OASIS standard in May 2005. The second edition of ODF 1.0 has been published by OASIS as a committee specification in July 2006 and accepted as an International Standard by ISO (ISO/IEC 26300) (3) in December 2006.

Figure 1: Evolution of ODF (February 2012) ODF 1.1 (4) has been published as an OASIS standard in 2007 and will be published as Amendment 1 of ISO/IEC 26300:2006 (5) in 2012. ODF 1.2 hhas been published as an approved OOASIS Standard early 2012 (6) and will probably become a PAS1 submission in ISO/IEC in the same year.

Office Open XML was first approved as a five‐part standard in December 2006 by the General Assembly as ECMA‐376. An updated version was published in November 2008 by ISO as ISO/IEC 29500:2008. The corresponding version, ECMA‐376 2nd edition (7), was published in December 2008. The consolidated version of OOXML including several corrigennda and amendments was pubblished in 2011 as ISO/IEC 29500:2011 and ECMA‐376 3rd edition (8).

1 PAS ‐ Publicly Available Specification

3 4 State of the Art

ECMA 376 - 1st ECMA 376 – 2nd ECMA 376 – 3rd 2006 2008 2011

Cor 1 - 2009 Cor 1 – 201? ISO 29500 Part 1 ISO 29500 Part 1 2008 2011 Amd 1 - 2010 Cor 1 - 2009 Amd 1 - 2011

Cor 1 - 2009 ISO 29500 Part 2 ISO 29500 Part 2 2008 2011

Cor 1 - 2009 ISO 29500 Part 3 ISO 29500 Part 3 2008 2011

Cor 1 - 2009 Cor 1 – 201? ISO 29500 Part 4 ISO 29500 Part 4 2008 2011 Amd 1 - 2010 Cor 1 - 2009 Amd 1 - 2011

Figure 2: Evolution of OOXML (February 2012)

2.1.1 Introduction to OOXML

OOXML is a four‐part standard consisting of:

1. Part 1 ‐ Fundamentals and Reference (9). This part contains the strict specification of OOXML. Until the day of writing there exists no implementation of this part. • Conformance definitions • Textual descriptions of the documents parts respectively the document markup languages defined by the standard: WordprocessingML, PresentationML, SpreadsheetML and further supported MLs. • XML schemas for the document markup languages using XSD and (non‐normatively) RELAX NG • Several examples, tutorials and primers • A list of differences between this part and ECMA‐376 1st edition 2. Part 2 ‐ Open Packaging Conventions (10). This part contains: • A description of the Open Packaging Conventions e.g. package model and physical package • Core properties, thumbnails and digital signatures • XML schemas for the OPC using XSD and (non‐normatively) RELAX NG • Several examples and guidelines • A list of differences between this part and ECMA‐376 1st edition 3. Part 3 ‐ Markup Compatibility and Extensibility (11). This part contains: • A description of extensions: elements and attributes which define mechanisms allowing applications to specify alternative content • Extensibility rules using NVDL2

2 NVDL ‐ Namespace‐based Validation Dispatching Language ‐ ISO/IEC 19757 (14)

4 5 State of the Art

4. Part 4 ‐ Transitional Migration Features (12). This part contains the transitional specification of OOXML. Until the day of writing most OOXML applications implement this part. This part contains: • Legacy material such as compatibility settings and the graphics markup language VML; Textual descriptions of the documents parts respectively the document markup languages defined by the standard: WordprocessingML, PresentationML, SpreadsheetML and further supported MLs referring to part 1 of the standard whenever appropriate • XML schemas for the document markup languages using XSD and (non‐normatively) RELAX NG • A list of differences between this part and ECMA‐376 1st edition

2.1.1.1 WordprocessingML

OOXML defines three major markup languages that have been developed rather independent. For this reason the amount of shared concepts is quite small. For example Part 1 introduces in following WML concepts from which a model for text documents and their features can be derived:

• Paragraphs and Rich Formatting • Tables • Custom Markup • Sections • Styles • Fonts • Numbering • Headers and Footers • Footnotes and Endnotes • Glossary Documents • Annotations • Mail Merge • Settings • Fields and Hyperlinks

The following “smart art” shows, how a taxonomy for the properties of a text document can be defined using the feature definitions of OOXML part 1.

5 6 State of the Art

Paragraph

Paragraphs and Run Formatting

Tables Run Content

Custom Markup Style Properties

Sections Table Styles

Numbering Styles Styles OOXML WML Document

Fonts Paragraph Styles

Numbering / Run Styles Lists

Headers and Footers

Footnotes and Endnotes

...

Figure 3: Sample features of OOXML wordprocessing documents

2.1.2 Introduction to ODF

ODF 1.2 is a three‐part standard consisting of:

1. Part 1: OpenDocument Schema (13). This part defines the XML schema for office documents such as text documents, spreadsheets, charts and graphical documents like drawings or presentations. It specifies: • Document structure • Document metadata • Document content • Formatting elements • Data types and attributes (major part of the specification) • Normative RelaxNG schema definitions • Guidelines 2. Part 2: Recalculated Formula (OpenFormula) Format (14). This part defines the formula language for OpenDocument documents called OpenFormula. It specifies • Evaluator types • Formula processing model • Data types to be used in formulas • Expression syntax

6 7 State of the Art

• Standard operations and functions 3. Part 3: Packages (15).This part defines a package format for OpenDocument documents. It specifies • Package types • Package content • Manifest file • Digital signatures • Metadata • ZIP file structure (non normative)

2.1.2.1 Text document

ODF defines one major markup language that defines all elements of OpenDocument documents and all attributes of these elements. For this reason a text document is not specified by a separate markup language but is a document with a body containing office text as depicted in Figure 4.

office:text‐attlist

office:text‐content‐ prelude office:text office:text‐content‐ main office:drawing office:text‐content‐ epilogue office:presentation

office:body office:body‐content office:

office:chart

office:image

office‐database

Figure 4: OpenDocument text document Typical content of a text document consists of:

• Text content such as headings, paragraphs, lists, or change tracking • Paragraph element content such as basic text, bookmarks and references, or notes • Text fields such as variable field or metadata • Text indices such as table of contents

7 8 State of the Art

• Tables such as basic tables or spreadsheets • Graphic content such as shapes, frames, animations • Chart content • Database front‐end content • Form content • Styles • Formatting elements

From this list a taxonomy for ODF text documents can be derived. To compare and map ODF documents to similar OOXML documents and vice versa it is necessary to define a common super model of both taxonomies or to define subsets of both taxonomies whose elements can be mapped in an unambiguous way. The idea to define feature based document profiles follows the second approach.

2.2 Conformity and Interoperability Definitions

Due to the existence of two open document formats ODF (OpenDocument Format) and OOXML (Office Open XML) many discussions about

• interoperability between the standards,

• conformity of documents and

• conformity of applications such as office suites, documents producers and consumers have been started. It is necessary to have a look at the precise definitions of these terms within the standards to be able to discuss these issues on a well‐defined basis and to come to common conclusions acceptable by the users of documents and office suites as well as by the developers of standards and office suites. The introduction of document profiles is impossible without a common understand of these basis terms and the corresponding concepts.

The relevant definitions about standard conformity and interoperability can be retrieved from ISO/IEC 29500:2008/2011 (respectively ECMA‐376 2nd (7) and 3rd (8) editions), ODF 1.2 Approved OASIS Standard (6), the ODF 1.1 Interoperability Profile (16) and the ODF state of interoperability committee specification (17). The statements about conformity and interoperability in ISO/IEC 29500:2011 are mostly similar to those in the 2008 version.

The purpose of these sections is to provide an overview on conformity and interoperability definitions for the two document formats. This overview helps to derive the definition of property based document profiles depends on the corresponding concepts in both standards.

2.2.1 Office Open XML

This section introduces excerpts from the ISO/IEC 29500:2008/2011 versions of the OOXML specifications which have been officially published in fall 2008 respectively 2011.

2.2.1.1 Application Descriptions

OOXML currently does not explicitly define the term “profile”. Instead an OOXML‐application can be defined as conforming to zero or more application descriptions in a particular conformance class.

8 9 State of the Art

The application descriptions defined within ISO/IEC 29500 are:

• Base ‐ An application conforming to this description has a semantic understanding of at least one feature within its conformance class. In addition, applications that include a user interface are strongly recommended to support all accessibility features appropriate to that user interface. • AnFull ‐ application conforming to this description has a semantic understanding of every feature within its conformance class.

2.2.1.2 Conformance Classes

The above mentioned application conformance classes must fulfil the following conditions:

• Existence of W3C XML schemas and an associated validation procedure for validating document syntax against those schemas. • Existence of additional syntax constraints that could not feasibly be expressed in the schema language in written form. • Existence of descriptions of XML element semantics. The semantics of an XML element refers to its intended interpretation by a human being.

An application is of conformance class WML/SML/PML3‐strict/transitional, if the application is a conforming application that is a consumer or producer of documents having conformance class WML/SML/PML‐strict/transitional. An application description should provide a machine‐processable schema, preferably using a member of the multipart standard ISO/IEC 19757 (18) that defines Document Schema Definition Languages (DSDL) such as RelaxNG, and the Namespace‐ based Validation Dispatching Language (NVDL)

A document conformance class refers to the appropriate W3C XML schemas and additional syntax constraints used to specify WML/SML/PML‐strict/transitional documents.

The standard assumes that additional application descriptions will be defined within the maintenance process for OOXML. It is also expected dthat thir parties might define their own application descriptions. Application descriptions would promote interoperability between applications implementing OOXML. They would also promote interoperability between applications implementing OOXML and applications implementing other document formats such as ODF.

2.2.1.3 Summary

Summarizing, the standard states that applications can conform to application descriptions based on feature definitions and document conformance classes. The intention of an application description is to promote interoperability between different applications that share the same conformance class. Following this idea, an OOXML document profile can be defined as a set of features within a document conformance class.

3 WML ‐ Wordprocessing Markup Language SML ‐ Spreadsheet Markup Language PML ‐ Presentation Markup Language

9 10 State of the Art

It is worth mentioning that the document conformance statement has been technically refined considering OPC4 and EMC in the first technical corrigendum (5) to ISO/IEC 29500‐1:2011 and considering VML5 in the first technical corrigendum (19) to ISO/IEC 29500‐4:2011. Additionally the interoperable generation and consumption of EMC extension lists have specified in a precise way.

Part 1 of ISO/IEC 29500 defines interoperability guidelines. These guidelines state that software applications should be accompanied by documentation that describes what subset of ISO/IEC 29500 they support. The documentation should highlight any behaviour that may violate the semantics of the document’s XML elements. It has to be ensured that for all operations on the XML elements defined in ISO/IEC 29500 that are implemented by the application the semantics for that XML element is consistent with ISO/IEC 29500. If the application moves, adds, modifies, or removes XML element instances with the effect of altering document semantics, it should declare the behaviour in its documentation.

2.2.2 OpenDocument Format

The OpenDocument specification ODF 1.2 (6) defines conformance for documents, consumers, and producers, with two conformance classes called conforming and extended conforming.

2.2.2.1 Conformance Classes

An ODF document of conformance class conforming shall be a conforming OpenDocument package and it shall conform to one of: OpenDocument Text Document, OpenDocument Spreadsheet Document, OpenDocument Drawing Document, OpenDocument Presentation Document, OpenDocument Chart Document, OpenDocument Image Document, OpenDocument Formula Document, OpenDocument Database Front End Document. An OpenDocument xyz Document is characterized by the existence of an child element of the element.

An ODF document of conformance class extended conforming shall be a conforming ODF extended package and may contain additional foreign elements and attributes as specified by the standard.

2.2.2.2 ODF Producer

An OpenDocument producer is a program that creates at least one conforming OpenDocument document, and that may produce conforming OpenDocument extended documents, but it shall have a mode of operation where all OpenDocument documents that are created are conforming OpenDocument documents. The program shall be accompanied by a document that defines all implementation‐defined values used by the OpenDocument producer.

An OpenDocument extended producer is a program that creates at least one conforming OpenDocument extended document, and that shall be accompanied by a document that

• defines all implementation‐defined values used by the OpenDocument extended producer and that • defines all foreign elements and attributes used eby th OpenDocument extended producer.

4 OPC ‐ Open Packaging Conventions (10) MCE ‐ Markup Compatibility and Extensibility (11) 5 VML ‐ Vector Markup Language

10 11 State of the Art

2.2.2.3 ODF Consumer

An OpenDocument consumer is a program that can parse and interpret OpenDocument documents according to the semantics defined by this standard that meets the following additional requirements:

• It shall be able to parse and interpret OpenDocument documents of one or more of the document types defined by the standard but it need not interpret the semantics of all elements, attributes and attribute values. • It shall interpret those elements and attributes it does interpret consistent with the semantics de‐fined for the element or attribute by the standard. • It should be able to parse and interpret conforming OpenDocument extended documents, but it need not interpret the semantics of all elements, attributes and attribute values.

2.2.2.4 Expressions and Evaluators

The ODF standard defines conformance for formula expressions and evaluators. An OpenDocument Formula Evaluator is a program that can parse and recalculate OpenDocument formula expressions. ODF distinguishes three groups of features that an evaluator may support. It shall conform to OpenDocument Formula Small Group Evaluator, OpenDocument Formula Medium Group Evaluator or OpenDocument Formula Large Group Evaluator. The three groups support formula expression with different types and complexity together with a different amount of math functions. For example the small group supports data of type text, integer and float numbers, logical and a basic set of corresponding functions. The medium group supports more functions and the large group supports complex numbers and corresponding functions. An evaluator may implement additional functions beyond those defined in ODF. It may further implement additional formula syntax, additional operations, or additional optional parameters for functions. Evaluators should clearly document their extensions in their user documentation, both online and paper, in a manner so users would be likely to be aware when they are using a non‐standard extension.

2.2.2.5 ODF Conformance and Interoperability

The OASIS “Open Document Format Interoperability and Conformance (OIC) TC” states in its paper on ODF interoperability (17) “that conformance is the relationship between a product and a standard. A standard defines provisions that constrain the allowable attributes and behaviours of a conforming product. Some provisions define mandatory requirements, meaning requirements that all conforming products must satisfy, while other provisions define optional requirements, meaning that where applicable they must be satisfied. Conformance exists when the product meets all of the mandatory requirements defined by the standard, as well as those applicable optional requirements… A standard may define requirements for one or more conformance targets in one or more conformance classes.”

Since the capabilities of office applications extend beyond simple desktop editor and include other product categories such as web‐based editors, mobile device editors, document convertors, content repositories, search and indexing engines, and other document‐aware applications, interoperability will mean different things to users of these different applications. However focussing on office

11 12 State of the Art applications interoperability consists of meeting user expectations regarding one or more of the following qualities6 when transferring documents:

• visual appearance of the document at various levels • structure of the document as revealed when the user attempts to edit the document • behaviours and capabilities of internal and external links and references • behaviours and capabilities of embedded images, media and other objects • preservation of document metadata • preservation of document extensions • integrity of digital signatures and other protection mechanisms • runtime behaviours manifest from scripts, macros and other forms of executable logic

The focus on the user’s expectations leads to the ODF interoperability model shown in Figure 5. This model introduced in )(17 defines document interoperability as the degree of analogy between the author’s intension and the reader’s perception.

• Author‘s intentions

•Application A‘s encoding (author)

•Document – standardized storage format

•Application B‘s decoding (reader)

•Reader‘s perceptions

Figure 5: ODF interoperability model The ODF 1.1 Interoperability Profile Committee Draft (16) clarifies and formalizes interpretations of the ODF 1.1 specification by creating an Interoperability Profile that adds conformance constraints to the specification. It is currently not intended by OASIS to specify profiles that restrict the ODF standard for specific application areas.

2.2.3 Summary

2.2.3.1 Conformity

Both document formats introduce conformance considering supported document types as shown in Table 1. OOXML distinguishes conformity with respect to strict and transitional markup languages. Such a distinction is not necessary for ODF. ODF conformity is based on schema validity, OOXML conformity is based on schema validity together with additional "written syntax constraints".

Table 1: Document types ODF OOXML Text Wordprocessing WML Spreadsheet Spreadsheet SML

6 These qualities are an example for the document features discussed in section 2.3.

12 13 State of the Art

Drawing Presentation Presentation PML Chart Image Formula Database front end

Application conformance is defined according to document conformance. Both formats distinguish document consumer and producer.

• A conforming consumer shall o OOXML: not reject any conforming documents of at least one document conformance class o ODF: be able to parse and interpret ODF documents of one or more of the document types • A conforming producer shall be able to produce: o OOXML: conforming documents of at least one document conformance class o ODF: at least one conforming document

Both definitions seem to be equivalent; even though they differ in the wording.

2.2.3.2 Extended Conformity

In ODF, a document can contain content that is not schema valid with respect to the ODF schema definitions. An ODF document is an element of the conformance class extended conforming if the document is an element of conforming after removal of the non‐ODF parts. In addition, the ODF specification introduces explicitly conformance classes for formula expressions and evaluators, based on the OpenFormula specification.

In OOXML the extension mechanism MCE is described in Part 3. A document is an element of the conformance class MCE, if it satisfies the corresponding syntax constraints on elements and attributes. This definition is more restrictive than the ODF definition.

2.2.3.3 Package Conformance

In OOXML, a document is of conformance class OPC if it obeys all corresponding syntactic constraints. An ODF file conforms to part 3 of the ODF specification if it is a zip‐file satisfying the corresponding constraints.

2.2.3.4 Profiling and Interoperability

OOXML introduces the concepts of an application description without elaborating the concept within the current version of the standard. Application descriptions should be used to refine the standard and to improve interoperability between different implementations as well as between OOXML and ODF.

ODF addresses interoperability issues in the OIC TC and publishes papers about interoperability (17) and an interoperability profile (16). The focus of ODF profiles is to improve the interoperability between ODF applications.

13 14 State of the Art

2.2.3.5 Conclusion

Both document standards include conformance statements for documents, documents producers and document consumers. While ODF focuses mainly on syntax related criteria and schema validity, OOXML considers textual syntax constrains and semantic aspects, too. Unfortunately, these definitions are rather weak and don't allow a precise definition and validation of conformance properties. For this reason both standards require to provide additional written documentation in case of implementation dependent solutions and any extension of the standard.

Interoperability and profiling are only tackled to a limited extent and have to be improved by both standards bodies OASIS and ISO in the future.

2.3 Document Features

There exist two major approaches for the identification of document features in ODF and OOXML. The Pentaformat (20) introduces the concept of “pattern based segmentation of structured content” to express the most used and meaningful structures of digital documents. An abstract document model has been developed using the following set of basic patterns together with five characteristics:

• A marker is an empty element, in case enriched with attributes • An atom is a markup unit of information • A block contains text streams and unordered and repeated nested elements • A record is a container of heterogeneous information, organized in a set of optional elements • A table is an ordered list of homogeneous elements • A container is an unordered sequence of repeatable and heterogeneous elements • An additive context is a context where a few elements are added in depth to existing elements • A subtractive context is at contex where some elements that would normally be allowed make no sense

The five dimensions for documents are:

• Content: what conveys semantics to the document • Presentation: what defines the visual aspect of the document • Structure: what provides organization and link the content to all the rest • Behaviour: what defines the dynamics of a document in an active environment • Metadata: what describes the document independent of its content

Following the approach depicted in Figure 6 mapping strategies between different document formats have been defined.

14 15 State of the Art

Figure 6: The Pentaformat7 Although the Pentaformat allows separating different aspects, in a translation operation the three situations can always occur:

1. Both formats support the same feature 2. The target format partially supports the feature 3. The target format does not support the feature

For a one way mapping either

• a feature can be translated by syntactical transformations or • a workaround solution has to be implemented ‐ a feature can be implemented using a combination of different features of the target format ‐ or • the feature cannot be translated at all

For roundtrip translation the Pentaformat suggests to include information about the feature representation in the source format as hidden or metadata in the target document.

2.3.1 ISO/IEC TR 29166

The ISO/IEC technical report TR 29166:2011 (1) aims at analysing ISO/IEC 26300:2006 and ISO/IEC 29500:2008 and their underlying concepts in terms of interoperability issues for a selected set of features. It analyses the way these features are implemented in both International Standards and estimates the degree of translatability between them using a table based comparison of document features and functionalities. ISO/IEC 29166 starts by studying common use cases to identify how the most important features and the implementation of the corresponding functionality in one document format can be drepresente in the other format. This is followed by a thorough review of the concepts, architectures and various features of the two document formats in order to provide a

7 Excerpt from (21)

15 16 State of the Art good understanding of the commonalities and differences. It is shown that functionalities will be able to be translated with different degrees of fidelity between the two formats. ISO/IEC 29166 provides, for an illustrative sample of this translatability, detailed information on the extent in which selected functionalities can be translated.

Document properties

Presentation instructions

Document content

Dynamic content

Meta data

Annotations and security

Document parts

Figure 7: Document properties in ISO/IEC TR 29166 The document properties introduced in the report and depicted in Figure 7 can be compared with the five dimensions of the Pentaformat. In detail they are defined as follows:

• Presentation instructions include all layout and presentation related information such as fonts, spacing, margins, colours, paper layout and settings, and animation in office documents. • Document content covers all properties of content (such as text, graphics and formulas) defined directly by the author of a document. • Dynamic content covers all aspects of automatically generated content including calculations or form functionalities such as fields, generated tables or dynamic references. • Metadata cover all information apart from the core document content. Metadata are used to describe meta information about the document such as the generator, version, authors and to ensure the accessibility of documents, for instance by using certificates. • Annotation and security covers all aspects of annotations used in a document including comments, change tracking, collaborative functions and security features such as encryption and access control. • Document parts cover all aspects (editing semantics) of structural document properties such as paragraphs, headings, headers, footers, tables, lists, footnotes, indices and captions.

16 17 State of the Art

Figure 8: Document features and functionality in ISO/IEC TR 29166 The report identifies specific features that are used to implement the six document properties for the three types: wordprocessing, presentation and spreadsheet documents. Seelected features and functions identified for wordprocessing documents are assembled in Table 2 beelow. This assembly is used as a starting point for the definition of feature based profiles as described in section 3.1.

Table 2: Selected features and functionalities of wordprocessing documents

Feature Functionalitty

Text formatting

Bold text (font weight)

Text borders

Whitespaces

Capitalization

Text colour

Complex script support

East‐Asian teext

Font selection

Font effectss

Manual speccification of run/ span width

Italic text

Kerning

Text language

Enable/ disable spell checking for run/ span

Raised/ loweered text

Strikethrough

Underline

Text colour

Complex script support

East‐Asian teext

Font selection

17 18 State of the Art

Feature Functionality

Font effects

Manual specification of run/ span width

Italic text

Kerning

Text language

Enable/ disable spell checking for run/ span

Raised/ lowered text

Strikethrough

Underline

Paragraph formatting

Line height

Text alignment (left/ right/ centered/ justified)

Keep paragraph on same page as following paragraph

Do not split paragraph into multiple pages

Tab stops

Hyphenation

Drop Caps

Register truth (same text line distance across multiple pages / columns)

Margins

First line indent

Page/ column break

Background colour

Background pattern

Background image

Embedded Images

Borders

Padding

Shadow

Line numbering

Vertical alignment (top, middle, bottom, baseline)

Asian / complex text layout properties

Writing mode (lr/rl/tb)

Text frames

Lists

Header and Footer

Content type

Properties

18 19 State of the Art

Feature Functionality Formatting

Tables

Table properties

Data alignment

Column settings

Row settings

Cell settings

Sub tables

Borders

Table headings

Itemization and numeration

Numbered Lists

Bullet lists

Nested lists

Captions

Indices

Table of contents

Table of figures

Table of tables

User defined indices

Bibliographies

Hyperlinks and references

Change tracking

Annotations

Text insertion

Text deletion

Formatting changes

Comments

Text highlighting

Metadata

Graphics

Embedded graphics

Vector graphics

Forms

Charts

Embedded data

Mail merge

19 20 State of the Art

2.4 Profiling and Document Interoperability

Profiling is a well‐known concept in standardisation. In case a standard is too complex or not intended to be implemented as a whole, self‐contained subsets of the standard are identified that guarantee interoperability between different implementations of the profile. How a subset of a standard that should be profiled is chosen depends on the given problem. The ISO concept database ‐ ISO 14772 (21) defines a profile (in the context of virtual reality modelling languages) as a “named collection of criteria for functionality and conformance that defines an implementable subset of the standard”.

As explained in section 2.2 neither OOXML nor ODF has introduced profiles within the standard. Nevertheless both standards refer to similar concepts that should improve interoperability of different implementations of one standard and between both standards. The definition of a common profile for two standards such as OOXML and ODF requires a common understanding of the standardized artefact, in our case of a “document”. As explained in section 2.3 such a common understanding can be achieved by the introduction of a document metamodel. The feature based approach introduced in this paper can be used to define such a metamodel.

Assuming the existence of a standardized, agreed metamodel withd standardize mappings to OOXML and ODF it is possible to define translation rules for common profiles between the two International Standards. Following the ideas of “Model Driven Architectures” (MDA) the metamodel corresponds to a “Platform Independent Model” (PIM) that can be mapped to two different “Platform Specific Models” (PSM). In case a reverse mapping from PSM to PIM exists a PSM‐document can be analysed and re‐mapped to a PIM‐document. This PIM‐document can be mapped to a PSM‐document defined in another “platform”. As shown in Figure 9 a feature translation can be implemented concatenating the two operations “feature detection” and “feature implementation”.

Document Meta model

Feature Feature f 2. Feature implementation implementation 1. Feature detection

Format A Format B

Feature f Feature f

Feature translation

Figure 9: Feature based document translation Section 3 shows how feature based document profiles can be defined and used to define translation rules utilizing feature detection and implementation functions. It has been out of scope of the TransDok project to define a document metamodel or interoperable document profiles. The project focuses on the development of concepts and tools to specify such profiles and especially to identify

20 21 State of the Art the characteristic features of given documents. The specification of such profiles is a typical task of standardization bodies or application domain specific communities.

One lesson learned from MDA approaches is that reverse engineering from a PSM to a PIM is eased by the inclusion of trace information from the PIMÆPSMg mappin in the PSM artefact. The introduction of such “feature annotations” in document formats would help a lot to achieve interoperability with respect to a given PIM document metamodel. Unfortunately neither OOXML nor ODF support such annotations. Such feature annotations are not necessary in case a document format supports a feature in a native and unambiguous way. In case a feature has to be “implemented” in a document format, the feature has to be composed of other, native features, an associated annotation is necessary to detect the semantics of the implementation during the feature detection process. A typical interoperability profile for OOXML and ODF will probably consist of features that are natively supported by both document formats.

The introduction of an object based or even object oriented document metamodel will go one step beyond the introduction of a feature based metamodel. An object based metamodel allows to define an object (document) as a set of objects (parts) together with the operations used on these objects. Such a document can be stored following the taxonomy of the object model and easily be mapped to existing document formats such as OXML or ODF. Again, these ideas go far beyond the scope of the TransDok project.

2.5 Tools and Languages

The project develops a prototypic implementation of the approach presented in this paper strictly utilizing standardized XML technologies such as ISO Schematron (18), XProc (22) and XSLT (23).

2.5.1 Document Packages

ODF and OOXML are both file formats that use ZIP‐archives as containers. These so called packages are containing sets of XML files. In order to check the profile conformance of a document, the content of the relating container has to be analysed. Common XML‐technologies were developed to operate on single XML instance documents. Consequently a validation of full packages makes it necessary to validate multiple XML documents contained in the package. In order to create an overall cumulated validation result, sequences of validation‐ and transformation steps have to be performed. The XML pipeline language XProc (22), that is a W3C recommendation since May 2010, allows the composition of such processes.

Even though an in‐place validation of ODF and OOXML is possible by using custom URI Resolvers e.g. supporting the not standardized JAR‐URL‐Syntax introduced by Oracle (24), the project used XProc to transform packages into a simple “envelope format” very much like a suggestion made by Rick Jelliffe on the ISO/IEC JTC 1/SC 34 mailing list:

„I think that all that is needed is a simple vocabulary with zip:archive, zip:folder and zip:entry. Non‐XML files could have an empty file with their name, to allow validation that a link points to the appropriate media etc.“ (25)

This flat representation of a package, i.e. a single XML instance document, has the advantage that common XML transformation and validation technologies can easily be adopted.

21 22 Methodology

3 Methodology

The specific scope of the TransDok project is the application of document profiling ideas to typical documents that can be found in the German Public Sector. For this reason in the first phase of the project interviews and a workshop with representatives from German municipalities, federal states and ministries have been performed. Typical document types identified in this phase are applications, minutes, offers, letters, invitations etc. Unfortunately the amount of documents that has been submitted to the project was too small for statistical analyses. Surprising was the fact that the major number of documents was stored using Microsoft’s old binary formats. Only few documents use OOXML or ODF. For this reason Internet search (crawling) for the identified document types in specific German domains has been done to retrieve a sufficient amount of documents.

These documents have been analysed with respect to a subset of important document features. The tools developed in the tprojec support this analysis using associated feature lists. In addition they support the definition and inspection of profiles. One major result during the technical work in the project was to detect that the mathematical validation of the idea to use feature based profiles to support document conformance, portability and interoperability is of high importance. Feature based profiles can only be used if the allow to separate different document types based on characteristic features. As a conclusion it seems to be meaningful to express the membership of a document to a profile using a statistical likelihood instead of a binary decision. More details about this approach is given in section 5

3.1 Definition of Document Features

The definition of document features in the TransDok project has been done in three steps. In a first step the documents that have been identified and submitted in the interviews and workshop have been analysed to identify domain specific features. For example official minutes have to support features such as headings containing text and graphics, tables, change tracking or digital signatures. In a second step these features have been compared with the features identified in the ISO TR 29166. As a result the domain specific features have been mapped to associated document features that are supported by OOXML and/or ODF. In the third step associated detection functions for these “feature candidates” have been defined using XML technologies. The set of feature candidate was used to define the feature list that itself was used as one input for the Feature List Generator. Details are explained in section 4.

3.2 Feature Based Profile Definition

In order to formalize the profile definition and validation of documents, some corresponding mathematical artefacts have to be defined. These definitions are based on the work presented in (26).

Let () and () be defined as the sets of all conformant documents according to the conformance definitions in ODF respectively in OOXML.

A standard validator is a function

22 23 Methodology

: 0,1 that decides if a given document is conformant ( ) to ODF respectively OOXML or not ( ).

An interoperability subset for ODF and OOXML is a subset of interoperable documents for which a translation function to a “similar” document satisfying the other standard exists.

: : : .

Analogous to the validation function a profile validator is a function

: 0,1 that decides whether a given document is in or not. All documents are interoperable or translatable from one format to the other.

If the elements of an interoperability subset are using similar features these features and corresponding functionalities can be used to define the associated feature profile . The document features used to define the profile are derived from the feature list developed in the ISO/IEC TR 29166 introduced in section 2.3.1.

ISO ISO 26300 29500

Figure 10: Interoperability subset and sample feature profile Assuming denotes the common set of all document features and associated functionalities for the document formats ODF and OOXML as introduced in Table 2.

The feature detection function

: returns the set of the feature names that are used by a document e.g. plain text, footnotes and headers. The profile definition function

: returns the feature set that is used by the documents contained in the interoperability subset and defines the properties of the corresponding profile .

:

To check if a given document is an element of the profile , it has to satisfy the following equation:

23 24 Methodology

Using the concepts introduced above the following steps are necessary to define a document profile:

• Define a feature set that can be used to characterize a document category like letter, invoice, application etc.

• This feature set defines an associated feature profile • Provide a standard validator to check if a given document conforms to ODF respectively OOXML: 1 ?

• Provide a profile validator to check if a given document conforms to the profile definition: 1 ?

Assuming the set of all features is defined by Table 2. To identify all features and functionalities

that can be used to define the characteristic feature set of a document category it is necessary to provide a feature detection function and a corresponding profile definition function . The project has implemented exactly these kinds of functions and validators to be able to validate the concepts explained above.

3.3 Profile Inspection

3.3.1 Binary Membership

The standards ISO/IEC 26300 as well as ISO/IEC 29500 use keywords to distinguish between different levels of obligations for normative clauses, as defined in Annex H of the ISO/IEC Directives Part 2 (27).

• may (may not) • shall (shall not) • should (should not) • can (cannot)

In the definition of feature sets, this keyword concept has been applied. Table 3 illustrates the extension of a feature set to 3‐tupels with : ,,, supporting a distinction between features that may, shall and shall not be used by documents. This concept implements a kind of whitelist/blacklist approach.

Table 3: Features that shall, shall not or may be used by documents in P Obligation Formalization May use the feature(s) 1 Shall use the feature(s) 1 Shall not use the feature(s) 1

The enhance profile validator function is able to consider these keywords. It is obvious that the profile definition function has to be improved accordingly. The implementation of the definition function delegates this task to the user.

24 25 Methodology

3.3.2 Statistical Membership

Considering a given document and a given profile the question if the document is an element of the profile seems reasonable. But what happens if the same document is also a member of a second profile ? Is it possible that a document is a letter as well as a report? This raises several new questions:

• Is it possible to define a metric for the difference between two profiles? Profile should be selective! • Is it possible to estimate the difference between statistical noise and a give profile? Again the profile should be selective. • Is it possible to state that a document belongs to a profile or not? Is it better to state that a document belongs with a likelihood of x% to a profile?

Because the entire idea of defining feature based profiles depends on the selectivity of profile definitions the project has decided to spend a huge effort on the implementation of feature selection functions and to evaluate the resulting feature sets using mathematical methods. The results of this work are presented in section 5.

25 26 Technical Details

4 Technical Details

Profile validation consists of three into three steps. Since only standard conformant documents can be contained in an interoperability profile, the first step is the schema validation of a given document with an already existing ODF or OOXML validator. The second step is the generation of the list of features that are used in the document. In the profile validation step the feature list is compared to the profile and a validation result is returned.

It has to be noticed that the result of the first step is not non‐ambiguous. As stated in section 2.2 conformance consists of schema validity as well as of syntactical and semantic constraints that are given in a non‐formal, textual representation. For this reason the set of standard conformant documents in Figure 10 is not well defined. Unfortunately the interoperability subset is also not well‐defined because the definition of translation rules between ODF MLand OOX depends on special assumptions and a given application context. There are no standardized transformations between both formats and probably such rules will never exist.

As a conclusion it is of extreme importance to have “good” definition of feature based profiles that allow checking:

• Schema validity • Syntactical and semantic validity • Interoperability • Membership to a document category

4.1 Feature List Generator

A single XML document allows generating lists of features used in a specific document by applying a suitable XSLT style sheet. Since the creation of such a style sheet is not trivial, we introduce a simplified XML language for feature definitions. At first a feature has a unique name (e.g. “footnote”, “header” etc.). This name is associated to number of format dependent detection functions. A detection function is expressed using XPath. The following table shows some examples of features and relating detection functions that can be used to verify the usage of images, tables and footnotes in an ODF or OOXML document.

Table 4: Examples of feature and relating detection functions Feature Standard Detection function Image ISO/IEC 26300 //odf-draw:image ISO/IEC 29500 //ooxml-a:blip Table ISO/IEC 26300 //ooxml-w:tbl ISO/IEC 29500 //odf-table:table Footnote ISO/IEC 26300 //odf-text:note[@odf-text:note-class='footnote'] ISO/IEC 29500 //ooxml-w:footnoteReference … … …

The introduction of a description language for document features and relating detection functions necessitates the creation a compiler or converter. In our implementation a feature description document is transformed by an XSLT‐processing scenario. In a next step the resulting XSLT‐style

26 27 Technical Details sheet (the output of the previous transformation) can be used to calculate the list of features detected within the flattened document.

Figure 11: An example architecture for a Feature List Generator The component for the creation of feature lists cannot only be helpful in the context of document profiling. Usually analysing of document round‐trips (a common method for interoperability tests) implies a lot of manual work. A feature list generator can automatize some essential checks regarding the persistence of specific characteristics of a document (e.g. is a table of contents still contained in a document after saving it with another application).

The feature definitions follow the XML‐schema shown in Figure 12.

Figure 12: Schema definition of document features A feature definition consists of the name of the feature and for every standard that implements the feature a detection function. This function is an XPath‐expression that defines how the feature is implemented in the standard. Because ODF as well as OOXML consist of several XML‐documents using their specific namespace the feature definition schema allows to define these name spaces and

27 28 Technical Details to use them in the XPath expressions. The following list shows the definition of the features footnote and endnote.

Due to the fact that the project has been executed in Germany the names of feature categories and features have been defined using German terms. For this reason the examples shown in the remaining part of this section use these German terms. The feature list generator supports a German ‐ English dictionary as an additional input to translate the names between both languages.

4.1.1 Using the Feature List Generator

The feature list generator supports two operating modes. As shown in Figure 13 the features of a single document can be detected, based on a given feature definition list.

Figure 13: Feature list generator analysing a single document The generator returns some statistical information about the document’s metadata, see Figure 14

28 29 Technical Details

Figure 14: Metadata information The main output is a summary of all features detected in the document as shown in Figure 15.

Figure 15: Summary of document features In the second operation mode the feature generator can be used to analyse all documents stored in a given directory. When the analysis has stopped the generator allows to select a single document and to show its features as shown in Figure 16 and in Figure 18.

Figure 16: Selection of documents and display of their feature lists The generated feature report for all documents can be shown within the feature generator, as shown in Figure 17, or it can be stored as a colon separated list. This list contains for each document and each feature the number of occurrences within the document as shown in Figure 19.

29 30 Technical Details

Figure 17: Feature statistics for a set of documents

Figure 18: Detailed statistics about occurrence of document features for a single document As shown in Figure 19 the feature list contains the summary of all features in all documents. In a colon separated list the following information is provided:

• Name of the feature o Name of the feature related functionality (refer to Figure 8) • Absolute number of documents using this feature/functionality • Relative number of documents using this feature

Figure 19: Feature summary of all documents

30 31 Technical Details

These values have been used for the mathematical analysis of feature based profile definitions presented in section 5.

4.2 Profile Definition and Checking

The architecture of our profile validation component is similar to the one previously presented. Instead of directly creating a validation schema for each profile, we introduced a simple based language, which can be used express the characterizing features. It also allows the distinction between levels of obligation (may, shall, should etc.). Consequently in our implementation a profile is a list of allowed and maybe disallowed features names. This decouples the definition of profiles form the standard dependent definition of feature detection functions.

Table 5: Examples of a simple profile definition Feature Level of Obligation Image shall Header may Footer may Table shall Footnote may … …

Whereas the feature definitions were translated to an XSLT stylesheet, profiles are going to be mapped to ISO Schematron files. ISO Schematron is a rule based validation language that can easily be integrated into an XProc pipeline.

Running this validation schema against a feature list produces a validation result. A violation of a rule of the obligation level “should” and “should not” are reported as warnings.

Figure 20: An example architecture for a profile validator It has to be considered, that this way of profiling – even though it may not be obvious ‐ is strongly exclusion oriented. Features that are not explicitly mentioned in a profile have to be inherently

31 32 Technical Details forbidden. This coherency has to be taken into account when translating a profile to a Schematron file. Therefore the profile to Schematron translator needs to access the feature definition list in order to create a “shall not” rule for unmentioned entries.

4.2.1 Definition and Testing of a Profile

As shown in Figure 17 the feature generator provides an overview on all features used in a set of documents. The next step in the utilization of the generator is the generation of a profile using this statistical information. As shown in Figure 21 a user can define and save a profile utilizing the statistical information and defining the level of obligation manually.

Figure 21: Profile definition Following Figure 13 the generated profile can be used to check the affiliation of a document with the profile definition. The profile itself is defined following the simple XML‐schema depicted in Figure 22 that implements the profile attributes described above..

Figure 22: Schema definition of a document profile

32 33 Profile Evaluation

5 Profile Evaluation

The project tests several profile definitions to assess the extent to which XML‐based profiles can be utilized to classify documents into predefined types without using any further information. We find that the average accuracy of classification algorithms reaches levels above 70%, making these approaches a viable complementary option to improve classification of documents. Due to the comparably low computational cost of these methods compared to text‐mining approaches a test‐ array using multiple methods could be used as a pre‐selection mechanism or supply additional information for multi‐method approaches to classification or profiling problems.

The recent move towards non‐binary, XML‐based document standards has opened up new opportunities for profiling document types according to the structure of descriptive XML‐tags used. The objective of the project is to use and identify such XML‐tags, which we in the remainder of this paper will refer to as “features” that can phel to differentiate documents according to different purposes these documents are used for regardless of their semantic content. This approach differs from the usual text‐mining approaches which would focus on such semantic content as it provides a clear set of potential features, thereby being robust against what is sometimes referred to as the “linguistic problem”, which are due to the imprecision of language and the difference between message and medium. A simple example to illustrate such a linguistic problem is the difference in meaning of words in different context such as the word “light”. The concept of “light” characterized as the physical phenomenon of wavelengths of a certain frequency is clearly different from such concepts as the “light of knowledge” or the even more complex meaning of “light my fire” which a naïve linguistic algorithm might contextualize as a property of the light‐emitting phenomenon we call fire. Linguistic models therefore rely heavily on supporting information such as cues or thesauri of similar meaning. The size and complexity of such thesauri can reach enormous levels which make them computationally “costly” in terms of processing of data. Using features derived from XML‐tags has a number of rather convenient implications.

• First, the number of features is limited providing researchers with the opportunity to concentrate on selection of features that differentiate between document types or document profiles. • Second, the data is easily extractable using rule‐based algorithms such as the method developed in the project. • Third, comparing the number of potential features n(C) to the number of words, concepts or phrases n(L) and the fact that for large datasets n(C) < n(L) the combinatorial challenge posed in linguistic models is drastically reduced.

Just to provide a perspective on this fact: The number of individual words used in this paragraph up to this sentence n(L)=197 already exceeds the maximum number of characteristics max(C)=191 on which the analyses in the project is based on.8 The goal now is to assess if using this limited number of features can produce a viable set of document vectors to improve the chance of classifying documents. For the sake of simplicity as well as to use a larger amount of different methods we will

8 It is fair to say, that the use of stopwords, extraction of descriptive word classes such as nouns or identification of phrases would reduce this number. Still, the number of individual words after applying stop words to a corpus of documents grows very fast with corpus size according to Zipf's Law. (29)

33 34 Profile Evaluation limit most of our analyses to a two type setting, i.e. we will use two sets of corpii9 each representing one distinct document type. Also, the interpretation of the classification results are easily benchmarked against an intuitive benchmark: the “proverbial coin toss”. This result is a 50:50 benchmark which would represent classification by pure chance into one of the corpii. This obviously requires that the profiles or document types are disjunctive and no document can belong to more than one type. As we will limit our analyses to approaches using XML‐characteristic we are unable to assess the potential of these approaches to outperform linguistic analyses on the same documents but rather aim to exemplify if these approaches might dbe use in at least a complementary fashion.

In the remaining parts of this section we will give a brief description of collection and pre‐processing of the data as well as a testbed we constructed to assess the overall accuracy of these classification methods. This testbed method will be used to asses a number of common classification approaches ranging from simple methods such as tests based on individual features to elaborate modelling approaches such as Neural Networks, Support Vector Machines (SVM) or flavours of Discriminant Analysis. A concluding discussion summarizes the results and provides an overall assessment of such methods for classification of document types and sketches out some potential use cases.

5.1 Dataset and Pre-processing

The document sets used for the assessment were collected from the World Wide Web using simple descriptive queries to Google as described in section 3. The resulting data was then screened to assess the validity of the data in respect to the document classes removing documents not corresponding to the document types from the data.

A total of seven document type corpii have been constructed:

• Reports from research projects (N=761) • Proposals for research projects (N=200) • Descriptions of research projects (N=626) • Letters (private and business) (N=844) • Meeting and event protocols (N=183) • Curriculum vitae (N=1146) • Invoices (N=50)

The features of the document then were extracted using the method described in section 4.1 resulting in files holding the total number of features found for each document processed (see table 1). The features can be used at different levels of abstraction as described in Section2.3. We used the second lleve of features (functionality) giving us a total of 119 characteristics to use in our analysis.

Table 6: Example10 of raw data produced by the feature generator

Feature Functionality Counts

Text formatting

Enable/ disable spell checking for run/ span 2

9 In linguistics a corpus is defined as a large and structured set of texts. 10 Test file “File 096_Pascucci_www.aep.wur.nl.docx.xml”. See also section 4.1.1.

34 35 Profile Evaluation

Paragraph formatting

Text alignment (left/ right/ centered/ justified) 34/0/234/169

Line height 150 Header and Footer Footer with page numbers 1

Properties 4

Formatting 2

Table Table properties 4 Text alignment 261

Row settings 67

Cell settings 377

Borders 3

Background 371

Footnotes 7

Metadata

Application name 1

The resulting CVS‐file was further processed with simple regular expressions in Perl to make them accessible to statistical software.11

Before commencing with the statistical analysis of the data it is worthwhile to take some characteristics of the distribution of the features into account assessing which features might be suitable for the analysis. The “sensitivity” of classification schemes can be negatively impacted by rare events, such as features that are only present in very few of the documents. An example for this is the feature “text format – blinking” a feature, which, apart from being an object of frequent mockery in other contexts such as HTML, also is an enormously rarely used feature in text formats. In fact, the expected chance of finding a document with this feature is one document in 5000 or p= 0.02%; based on our data. Naive approaches as well as some recursive approaches, i.e. those that do not take the very low probability of such events into account, could put a high differentiating effect to this single variable. Vice versa, features that are omnipresent (p=1) in document such as the characteristic “text formatting” will hardly be a good source of differentiation for any classification scheme. Figure 23 illustrates this by ranking the characteristics in decreasing order by the share of documents they appear in. The most striking result is that a large amount of features are rather rare with nearly half the features (N=15, 42.8%) only appearing in 5% of all the documents collected. This fact as well as the log‐linear relationship between rank and share of the features calls for a multivariate approach as we will see in the simple example of testing on per features basis (see 3.1).

11 We used the statistical environment R (r‐project.org) to perform all of the data analyses.

35 36 Profile Evaluation

Figure 23: Relevance of features for the characterization of documents Apart from the distribution of identified features the absolute number of occurrences of feature in each document can be both a source of information as well as a source off potential error for classification algorithms. Similar to the distribution of features found in document some features also are log‐linear distributed in terms of how often they can be found in a document. Even though this is not true for all features some central features that can help differentiate types if used in a binary scaling are heavily skewed to the right as well as being heavily zero inflated when used as interval scaled variable. One example is “footnotes” ffor the document type “report” (seee Figure 24)

Figure 24: Distribution of the number of footnotes in reports On the one hand 70% of the report documents feature no footnotes (N=539) wwhile methods using measures based on means as a basis for classification will be heavily influenced by large number of

36 37 Profile Evaluation footnotes in documents. Using measures such as quartiles, mean and median to describe such a distribution illustrates this even more drastically (see Table 7). The mean is heavily influenced by the outliers in the distribution. Also, commonly applied fixes such as using methods based on the median would provide no useful information as the median is 0. We are therefore presented with two challenges using the interval scaled representation of the data. The mean is clearly not robust due to outliers; models using the median would in this case “collapse” to a binary model (footnotes vs. no footnotes). We therefore expect models that properly acknowledge binary data as a basis of analysis will outperform the more information rich interval scaled data.

5.2 Testbed Specifications

Some of the methods used, i.e. those that operate on the level of individual documents rather than comparing proportions or centrality measures for the different document classes. To assess the robustness of the classification we constructed a testbed that performs a number of steps of shrinking the number of features according to specific criteria required by a majority of classification models and in a next step randomly select a training data set from the total population of both corpii. The training data then was used to construct the model specific classifiers which then were used to classify test data. Training and test data where disjunctive sets, i.e. none of the cases in the training data where used in the test data. To allow for a replication as well as an unattended operation of the testbed characteristics reduction was performed in seven steps based on the training data selected for each run. In total the following cases where excluded from the testbed.

• Features that are not present in the total training data, i.e. features that are present in none of the documents. • Features that are present for every case of the training data such as the characteristic “text” which was found ine all th cases and therefore will not produce a benefit for accuracy. • Features which feature a high level of co‐linearity (above a critical number of Pearson or Spearman characteristics with a correlation coefficient of .5 to other characteristics). • Features below a threshold t (t=10) of occurrences in the training data set, i.e. all features that appear less than t documents in the whole training data set.

This approach was done on a per case basis, i.e. for each individual model run. The overall performance of the classification methods was assessed by applying the classifying function derived from the training data to predict the document type for the documents in the test data. The share of documents classified into the “correct” class based on the ex‐ante group membership attributed in the data collection process. Each classification method was run 100 times using different training and test data resulting in a distribution of the accuracy reached in each run. This distribution was then analysed using measures of centrality (mean and median accuracy) as well as the spread of accuracy using the variance over the individual accuracy values.12 In the next section we will use different methods of classification to check how they perform on the task of identifying document types. In the cases where we used our testbed specification we will use the document types “report” vs.

12 All accuracy distributions where tested for symmetry using Kolmogorov Smirnoff tests.

37 38 Profile Evaluation

“curriculum vitae” as basis for our analysis. We will use 700 documents of each type as to minimize the influence of prior distribution of cases for some of the methods.13

5.3 Classification Approaches

This section comprises the main results of our work. Namely the actual use and performance of different methods to classify documents into predefined types. In some of the cases [Fisher's exact Tests (5.3.1), Cluster Analysis (5.3.2), Logistic Regression (5.3.3) and Recursive Partitioning Trees (5.3.4)] we will not use the testbed as ddescribe above. Reasons for this are provided in the individual sections. Yet, for the sake of breadth we include the methods to inform about shortcomings and potentials of these approaches. Also we will not provide detailed mathematical accounts for each model but rather point to the relevant literature as well as the implementations we used in our tests. We will though provide a very brief description of how these methods are usually employed.

5.3.1 Fisher Exact Tests

The first and simplest methods used in this section do not per se aim at a classification based on a multivariate approach. Rather they aim at supporting the identification of proportions of features as they occur in different document types. Both Fisher's exact test as well as the proportion test operate on a per feature basis, i.e. the test only includes information about the relative occurrence of a feature in a certain document type, i.e. in how many of the documents of a certain type the feature occurs. The methods therefore imply a mind‐set where we only know about the relative occurrence of one single feature for the two types in question being ignorant about the distribution of the other features. Yet, this is relevant as we can see if there is any chance at all that the features will help us differentiate different document types. It can be considered an easy and quick pre‐test that can be enriched by the multivariate analyses conducted in later sections.

In a nutshell this means: If we find any significant differences using these simple bivariat tests, subsequent analyses might eventually not be worth further effort. Using the “report” type as a reference we can now assess if there are significant differences to other types using only one category at a time. The results are visualized in Figure 25. Black cells denote significant differences (p<0.05). The amount of black cells implies that there are a huge number of potential features that could be used for a classifier. It has to be noted though, that there is a large amount of co‐linearity between the features, which would negatively impact a classifier. Moreover, these statistics are based on the aggregate level of the document types. It must not be confused with the subsequent models which operate on a document level. Still, the results show that on an aggregate level there are indeed substantial differences between the types which warrant further investigations. Yet, the robustness of this approach might be low due to the distribution of features as discussed in section 5.2.

13 Some algorithms such as the Discriminant Analysis take into account the prior distribution, i.e. the algorithm will be sensitive to the proportion of the documents in each class. A 70:30 distribution in the test data might load the deck by taking this ratio into account. In a real life situation this might be beneficial. For instance for a use case where this distribution might hold valuable information. As we want to have a fair benchmark we try to eliminate this effect. As the selection of the test data is random we might experience slightly skewed ratios which should ameliorate due to the repeated testing using each method.

38 39 Profile Evaluation

Figure 25: Differences of feature occurrences between reports and other document typees

5.3.2 Cluster Analysis and Heatmaps

Another way of pretesting for the aggregate distribution of features in the different types incorporates cluster analysis and visualization techniques such as heatmaps. Similar to the techniques used in the previous section cluster analysis is not a method that is geared towards being

39 40 Profile Evaluation used as a classification approach.14 It can be used to check how strong a set of vvectors, which in our case are the features of a document, can be used to determine how “closely related” different document types are. This goes beyond a simple bivariate correlation or testing approach used in the previous section as cluster analysis puts a strong focus on interrelation between the vectors as well as between the vectors and document types. In our case cluster analysis can be used in two ways. First, we can assess how the document types cluster into different categories based on the distribution of features. Second, we can asssess how the features are interrelatted by how they are distributed between document types. In the first sense, we can qualitativelyy use this method to assess if the results of a clustering based on features make intuitive sense, i.e. if a prior assumption of relationships between document types iss represented in the cluster data. In tthe second sense we get an impression of bundles of features that differentiate between our document types. As in the case of the Fisher Exact Tests we use a matrix of shares of features as they appear in the different document classes.

Figure 26: Heatmap of document properties for different document types We visualized the results using a heatmap in Figure 26. The heatmap visualization has some advantages in our case as it summarizes both the cluster analyses, i.e. clusters of document types as well as clusters of features, as well as provides a visualization of the overall distribution of the data in terms of shares of features in document typees. Noteworthy in this context is thee distinct clustering of

14 Cluster analysis can be used to determine the amount if individual documents that previously have been assigned to one document type are dispersed over a number of different clusters. Yet, as the cluster analyssis does not include such information we will not go into great detail in this paper. This difference, i.e. methods that requuire a specific response variable such as document types are generally referred to as “supervised”. Methods such as cluster analysis or Multidimensional Scaling that do not use such response variables are called “unsupervised”. Allso, it is worthwhile to note that this is not necessarily due to the k‐Means estimation used. There are k‐means based classification estimators such as k‐nearest neighbour. Yet, we will not use thhis method in our paper.

40 41 Profile Evaluation the document types that refer to research projects, more precisely, the “report”, “proposal and “project description” document types.

This can be seen as partial evidence to our implicit hypothesis, namely that features not only allow differentiation into different document types but also that certain classes of document types cluster together. Apart efrom th clustering of the research project based cluster we find that the document types “letter”, “CV” and “protocol” form a cluster with “invoice” being a separate and distinct category.

5.3.3 Logistic Regression

Regression models for binary data, such as the Logit or Probit models are commonly used to have a different purpose than classification and have their main application in designing and testing explanatory models, i.e. the focus for these models usually is on hypothesis testing or explaining through statistical inference. Just as the cluster analysis it is a multivariate technique, i.e. it takes into account information on more than one feature. Also, in contrast to the models discussed before we apply it to the document micro data, i.e. we no longer use aggregate information on the document type such as the share but rather use the document type as the response variable (document type) which we aim to explain through the independent variable (features). The formula that is then estimated through applying the Logit regression to the training data is then applied to the independent variables of the test data. The result is a “prediction” of the document type in the test data. The outcome of this prediction for each document can then be compared to the group membership we assigned in the coding data.15 Applying our testbed specification to the data we find the following results as summarized in Table 7

Table 7: Logistic regression Logit binary Logit interval Minimum 0,7033 0,5744 1st Quartile 0,7258 0,6592 Median 0,7378 0,6683 Mean 0,7354 0,6681 3rd Quartile 0,7422 0,6831 Maximum 0,7611 0,7011

The accuracy value or more precisely the distribution of the shares of correctly classified documents for the 100 distinct test beds has a mean accuracy value of 0.73. This value implies that when using our logistic regression function we would correctly classify “reports” and “CVs” into their relevant classes. Mean nand media are rather close, which means that the centrality of the distribution is good and the mean is rather robust. The accuracy values range between .71 (worst) and .77 (best).

15 More precisely, the Logit regression performs in a 0 vs. 1 way. In our context we do not distinguish between two document types per se but rather distinguish documents coded with one against “not one” or zero, i.e. one type is distinguished against a reference, which in our special case are documents from another class. Extensions of this approach could include increasing the reference to all documents that do not belong to a certain class or use the multinomial version of the Logit regression to compare more than two types. For the sake of comparability between the approaches we limit ourselves to the case of two document types.

41 42 Profile Evaluation

Apart from the prediction we can also gather information which of the features have a significant effect to explain the difference between reports and CVs. We therefore have a look at the coefficients as well as the results of the significance tests that are performed in terms of the Logit model. Reports have been coded as 0, while CVs have been coded as 1. The coefficients of one of the Logit‐models can be found in Table 8 starting out from the significant values marked by stars and dots in the final column. The more stars, the higher the level of significance. The coefficients can be found in the column “Estimate”. Significant coefficients with a negative sign imply that these features are a good indicator for predicting a document of the type “report”. Among those features we find “footnotes”, “ToC”, but also features such as “Format: underlined text”.

Table 8: Coefficients of a Logit‐model

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.13050 1.01307 1.116 0.264458

`Absatzformat - Blocksatz` -1.08279 0.29368 -3.687 0.000227 ***

`Absatzformat - Einzug bei erster Zeile` -1.11860 0.59757 -1.872 0.061217 .

`Absatzformat - Hintergrund` 0.63494 0.38247 1.660 0.096891 .

`Absatzformat - Initialen` 17.06461 3032.68502 0.006 0.995510

`Absatzformat - Linksbündig` -0.32495 0.52639 -0.617 0.537021

`Absatzformat - Positionsrahmen` -0.26728 0.31843 -0.839 0.401264

`Absatzformat - Positionsrahmen ausgerichtet` 0.18623 0.64709 0.288 0.773498

`Absatzformat - Rahmen` -0.79616 0.32541 -2.447 0.014421 *

`Absatzformat - Rechtsbündig` 0.33703 0.30085 1.120 0.262606

`Absatzformat - Schatten` 1.72049 1.93823 0.888 0.374722

`Absatzformat - Seitliche Begrenzung` 1.51501 0.56407 2.686 0.007234 **

`Absatzformat - Zeilenabstand` -0.17323 0.42023 -0.412 0.680181

`Absatzformat - Zentriert` -0.06091 0.39306 -0.155 0.876858

Änderung -14.11070 2094.43000 -0.007 0.994624

`Automatische Silbentrennung ` -0.29328 0.51981 -0.564 0.572618

Beschriftung -1.82729 0.89481 -2.042 0.041140 *

`Definition ignorable namespace` -0.16452 0.37288 -0.441 0.659050

Endnoten -17.83751 2056.23747 -0.009 0.993079

Formeln -18.11642 1963.69434 -0.009 0.992639

Fußnoten -2.85425 0.57583 -4.957 7.17e-07 ***

`Fussnote unter dem Text` -0.09927 1.14587 -0.087 0.930966

`Fußzeile mit Tabelle` 1.98897 1.03595 1.920 0.054865 .

Inhaltsverzeichnis -3.60634 1.18017 -3.056 0.002245 **

`Kopfzeile mit Grafik` 0.17623 0.68797 0.256 0.797830

`Kopfzeile mit Seitenangabe` 0.26641 0.56187 0.474 0.635389

42 43 Profile Evaluation

`Kopfzeile mit Tabelle` -1.75654 0.86215 -2.037 0.041610 *

`Liste - Nummerierung` -0.65533 0.31897 -2.054 0.039928 *

`Liste - Punkte` 0.58956 0.31872 1.850 0.064348 .

Punktdiagramm -28.07322 2687.38320 -0.010 0.991665

`Tabelle - Geschachtelt` 0.82862 0.88899 0.932 0.351288

`Tabelle - Schatten und Hintergrundfarbe` 1.99534 1.28287 1.555 0.119857

`Tabelle - Textrichtungen` -20.52730 3337.09615 -0.006 0.995092

`Tabelle - Verankert` -1.84920 0.71559 -2.584 0.009762 **

`Tabelle - Wiederholung der Zeilenüberschrift` -0.54734 0.80411 -0.681 0.496076

`Tabelle - Zusammengehalten` 2.20296 0.86696 2.541 0.011053 *

`Textformat - Einfach unterstrichen` 1.43704 0.38902 3.694 0.000221 ***

`Textformat - Farbig hervorgehoben` -1.04637 0.47166 -2.218 0.026523 *

`Textformat - Fett` -0.08613 0.98443 -0.087 0.930281

`Textformat - Hochgestellt ` -0.35887 0.28636 -1.253 0.210117

`Textformat - Kapitälchen ` 0.13389 0.41705 0.321 0.748177

`Textformat - Kursiv` 0.26783 0.41696 0.642 0.520653

`Textformat - Theme bei Textfarbe ` -1.07933 0.33856 -3.188 0.001433 **

`Textformat - Tiefgestellt ` -0.25563 0.47732 -0.536 0.592270

`Textformat - Umrandung ` -0.53300 0.92233 -0.578 0.563340

Titel -0.54989 0.43576 -1.262 0.206984

Vektorgrafik 16.09244 1680.05532 0.010 0.992358

Verweis 0.23232 0.30671 0.757 0.448766

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5.3.4 Recursive Partitioning Trees

Recursive partition trees follow a different approach. Rather than taking into account all the variables at once the variables are put into a tree structure not unlike the more commonly known decision trees. Starting at a root node we can follow the tree through to one of the end nodes which are linked to a prediction of one of the two classes. The advantage of decision trees is that they may use much fewer variables than such approaches like Discriminant Analyses or SVM. The response value at the end node is equal to the class that features a majority share of documents from a certain type. The disadvantage of such classification trees is that those trees are rather sensitive against sampling. It is therefore not useful to use our testbed specification on the partition tree method.

43 44 Profile Evaluation

Figure 27: Partitioning tree

5.3.5 Neural Networks

Neural networks operate on a different level. While the Logit‐model uses the complete set of variables in one modelling function and the partition trees use a hierarchicall sequence of binary decisions to classify and have a clear cut functional approach, neural netwoorks are conceptually oriented to mimic characteristics of the networked structure of neurons in human brains. Just like the partition tree approaches neural nettworks can cope with non‐linearittyy. Generally, neural networks are a viable option in case of binary data and perform less well with ordinal, multinomial or interval data. Just as with partition trees the power of neural network lies in thee interaction between variables. In contrast to partition trees though these are not limited to a singgle tree structure but rather a network in which “neurons” can allso send information back to neurons that are upstream towards the input nodes. Neural networks quickly get computationally costly due to the multitude of possible connections relative to the starting points and endpoints. We thereforee use a simple version of neural networks: single‐hidden‐layer neural networks. In this case we construuct a neural network with a start node for each of the features and two output nodes representing the two document types. Between input and output layer one hidden layer provides features weighhts that transform the inputs into outputs. The network structure, the extent of possible interaction terms between the variables and the non‐linearity of the problem makes finding optimal weights challenging and can presently only be approximated using optimization algorithms such as the BFGS optimizer. The fact that there is no single solution to this optimization problem can lead to differences in results each time the same data is processed by the samme neural network. In contrast to Logit‐models and Linear Discriminnant Analysis approaches Neural Networks are non‐parametric, i.e. there is no distinct functional form per se. As with the other approaches we applied the model in our testbed. The results are summarized below:

44 45 Profile Evaluation

Table 9: Neural networks nnet binary nnet interval Minimum 0,6789 0,5744 1st Quartile 0,7019 0,6036 Median 0,7189 0,6267 Mean 0,7166 0,6261 3rd Quartile 0,7306 0,65 Maximum 0,7478 0,6922

5.3.6 Support Vector Machines

Support Vector Machines (SVM) is a second class of methods that are non‐parametric. Yet, in the case of two classes Support Vector Machines have a similar approach to classification as the parametric Logit‐models. The crucial difference being in the way the differentiating border, a n‐ dimensional hyperplane with each feature representing one dimension is constructed to differentiate between cases of one group to a reference frame or points of another group. While in the logistic regression case this hyperplane is computed to simply divide between data points the SVM method additionally takes the relative distance of the points from this hyperplane into account by maximizing the aggregated distance between the data points of each group from this hyperplane. Just as in the Logit case the classification errors are a result that such a perfectly differentiating hyperplane does not exist. The SVM approach usually results in a better classifier compared to Logit models at the cost of more computational complexity. The result of our testbed approach is summarized in Table 10.

Table 10: Support vector machines SVM binary SVM interval Minimum 0,6756 0,5811 1st Quartile 0,7022 0,6064 Median 0,7122 0,6406 Mean 0,7126 0,632 3rd Quartile 0,7231 0,6528 Maximum 0,7711 0,6767

5.3.7 Discriminant Analysis

Finally, we want to test two classification techniques from the family of discriminant analysis. We will look into two different flavours of discriminant analysis. We will use both a parametric version (LDA) as well as a non‐parametric version (FDA).

5.3.7.1 Linear Discriminant Analysis

The Linear Discriminant Analysis (LDA) constructs a decision boundary that is based on the pooled covariance matrix of predictors to determine membership for a response variable. The goal is to construct a linear discriminant function that produces probabilities for each of the types or classes used in the analysis. The individual case is assigned due to that class where the probability of membership is maximized. The discriminant function is estimated based on a multivariate regression similar to the Logit‐model case except that in the case of LDA a different estimator is applied.

45 46 Profile Evaluation

Table 11: Linear discriminant analysis LDA binary LDA interval Minimum 0,7111 0,6178 1st Quartile 0,7314 0,6489 Median 0,7394 0,6583 Mean 0,7373 0,6572 3rd Quartile 0,7453 0,6675 Maximum 0,7556 0,6856

5.3.7.2 Flexible Discriminant Analysis

The Flexible Discriminant Analysis (FDA) is a non‐parametric version of a discriminant analysis. Rather than attempting to fit a linear boundary function using a multivariate parametric regression the FDA uses a multivariate non‐parametric regression. Similar to the partitioning tree approaches the FDA can account for non‐linearity and ise mor flexible than it's parametric counterpart.

Table 12: Flexible discriminant analysis FDA binary FDA interval Minimum 0,7633 0,6622 1st Quartile 0,7731 0,6836 Median 0,7811 0,6906 Mean 0,7796 0,6906 3rd Quartile 0,7856 0,6975 Maximum 0,7956 0,7133

5.4 Synopsis

Overall, the methods applied have an average predictive accuracy well above the 70% level. Yet, there seems to be no method that drastically outperforms the other approaches with the exception of the FDA analysis which scored highest and has the smallest variance in accuracy (see Figure 28)

Generally, the non‐parametric Neural Network and SVM approaches performed the worst. This is surprising as those methods are usually referred to as very appropriate classifiers. The Logit‐model as well as the LDA has roughly the same accuracy values, which partly is due to the similar estimation approach only differing in the way the error terms are assessed. The FDA performs best both in terms of accuracy as well as reliability. Due to the test setting, i.e. our two‐group approach some methods such as the LDA were not applied to their full potential, namely multi‐group classification. Still, in the case used here it is safe to say that some structural elements seem to exist that indeed allow the classification of documents based on XML‐tags even though there is room for improvement in accuracy.

Table 13: Comparison of evaluation methods Minimum 1st Quartile Median Mean 3rd Quartile Maximum FDA 0,7633 0,7731 0,7811 0,7796 0,7856 0,7956 LDA 0,7111 0,7314 0,7394 0,7373 0,7453 0,7556 LOGIT 0,7033 0,7258 0,7378 0,7354 0,7422 0,7611 NNET 0,6789 0,7019 0,7189 0,7166 0,7306 0,7478 SVM 0,6756 0,7022 0,7122 0,7126 0,7231 0,7711

46 47 Profile Evaluation

The approaches applied in these contexts did not take into account the structure of XML‐tags such as in which order the XML‐tags appear in the document. Especially non‐parametrric recursive methods such as partitioning trees and even more drastically neural network approaches would benefit from this additional information. Also, the prroblem of co‐linearity cannot be addressed using the classification scheme as is. Attempts should be undertaken to either aggregate ffeatures using cluster analysis or factor analysis etc. or cluster the features intellectually into coherent profiles.

Figure 28: Comparison of evaluation methods

47 48 Summary

6 Summary

6.1 Summary of Project results

The TransDok project has introduced a methodology to define feature based document profiles for the two International Standards OOXML and ODF. These profiles can be used to ensure document portability between different document producers and consumers what is a typical situation between collaborating parties. Additionally the profiles can be used to enable document interoperability between both International Standards and supporting office suites.

The project has introduced XML schema to define document features and profiles. It developed a feature list that can be used to analyse typical documents found in the German Public Sector. A feature list generator has been implemented that creates a list of all features used in a given document respectively a set of documents. Using the characteristic features of a document set an associated profile can be defined and the membership of a document to the profile can be inspected.

The suitability of feature based profiles to describe and separate documents types has been evaluated and proven using statistical evaluation techniques.

Intermediate results of the TransDok project have been discussed in ISO/IEC SC34 WG5 meetings, the German mirror DIN NA 043‐01‐34, in different ODF plug fests, in the 7th IEEE conference on “Standardisation and Innovation in Information Technology” (26) and during the eChallenges 2011 conference in Florence. The final results of the project are published in the available report and will be discussed again in ISO/IEC SC34 WG5 in the context of the definition of a new working item.

6.2 Practical Relevance

6.2.1 Technical Relevance

The technical results of the TransDok project can be used as input for standardisation bodies to include profile concepts in both International Standards OOXML and ODF. The results of the project are included in the current WG5 study period report and will probably influence the next work items in WG5.

Additionally the developers of document templates and document producers/consumers can benefit from the work to enhance document portability and interoperability. The feature lists developed in TransDok can be used as a starting point for the definition of a common document metamodel and translation rules between OOXML and ODF documents.

The technical results can be used as an important input for the work in the working group “Office Interoperability” (28) of the “Open Source Business Alliance” (29) that is going to improve the OOXML interoperability in LibreOffice (30).

6.2.2 Economic Relevance

The need for the development of tools for checking the conformity of office documents and to improve interoperability between office applications continues to have a high priority. For this reason, the project results will be demonstrated in the “Document Interoperability Lab” at the

48 49 Summary

“Fraunhofer Centre for Interoperability” (31) to partners from government and industry to achieve and improve solutions for document portability, interoperability and conformance in joint projects.

Assuming the support of this work by DIN and ISO and the presentation of project results at international conferences a basis for the acquisition of new projects will be set, in which the results can be applied and further developed.

If the ideas developed in the project are applied to typical document types in the German public sector, their interoperability and portability of documents can be enhanced significantly. The integration of the feature list generator in archiving systems will enhance the likelihood of sustainable storage of documents and reduce interoperability problems significantly.

6.3 Open Issues

6.3.1 Scientific Challenges

Due to continuous improvements and modifications of the considered standards and the increasing proliferation of electronic documents new requirements for the profiling of document formats will probably arise in the near future. The development of profiles for the use of mobile devices as consumers of documents or for exchangeable applications and administrative decisions may be considered as examples.

Further scientific challenges and new business marketing opportunities will occur in the future. The current International Standards OOXML and ODF are too complex to be completely implemented by various vendors of office applications. In contrary more and more dedicated document producers will create application specific documents. The definition of application specific profiles for such documents will improve their portability and interoperability significantly. The definition of a common metamodel for both International Standards or even for one standard is a scientific as well as a commercial challenge because the work must be adequately funded and accepted internationally. In addition the consideration of further document standards will improve the complexity of such a metamodel.

It has been shown that it is easier to create a profile conformant document, for example using associated templates, than to check the conformity of a given document. The reason for this fact is that profiles are not disjoint and the membership of a document to a profile is ambiguous. The development of a reliable profile checker that is able to detect the profile having the maximum likelihood of membership seems to be a non‐trivial task.

49 50 References

7 References

1. ISO/IEC JTC 1/SC34. Guidelines for translation between ISO/IEC 26300 and ISO/IEC 29500 document formats. ISO/IEC TR 29166:2011. [Online] December 2011. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45245. PDTR 29166.

2. OASIS ‐ Organization for the Advancement of Structured Information Standards. OpenDocument v1.0 Specification. http://www.oasis‐open.org/committees/download.php/12572/OpenDocument‐ v1.0‐os.pdf : s.n., May 2005.

3. ISO/IEC JTC 1/SC34. Open Document Format for Office Applications (OpenDocument) v1.0. ISO/IEC 26300:2006. [Online] 30. November 2006. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43485. ISO/IEC 26300:2006.

4. OASIS ‐ Organization for the Advancement of Structured Information Standards. OpenDocument v1.1 Specification. http://docs.oasis‐open.org/office/v1.1/OS/OpenDocument‐v1.1.pdf : s.n., February 2007.

5. ISO/IEC JTC 1/SC34. Open Document Format for Office Applications (OpenDocument) v1.0 ‐ Amendment 1 (ODF 1.1). [Online] 2012. http://www.iso.org/iso/iso_catalogue/.

6. OASIS ‐ Organization for the Advancement of Structured Information Standards. Open Document Format for Office Applications (OpenDocument) Version 1.2. http://docs.oasis‐ open.org/office/v1.2/OpenDocument‐v1.2.pdf : s.n., September 2011.

7. Ecma International ‐ European association for standardizing information and communication systems. Standard ECMA‐376 Office Open XML File Formats ‐ Second edition. [Online] December 2008. http://www.ecma‐international.org/publications/standards/Ecma‐376.htm.

8. —. Standard ECMA‐376 Office Open XML File Formats ‐ Third edition. [Online] June 2011. http://www.ecma‐international.org/publications/standards/Ecma‐376.htm. ECMA‐376 3rd edition, ISO/IEC 29500:2011.

9. ISO/IEC JTC 1/SC34. Office Open XML File Formats ‐‐ Part 1: Fundamentals and Markup Language Reference. ISO/IEC 29500‐1:2011. [Online] 2011. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59575.

10. —. Office Open XML File Formats ‐‐ Part 2: Open Packaging Conventions . ISO/IEC 29500‐2:2011. [Online] 2011. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59576.

11. —. Office Open XML File Formats ‐‐ Part 3: Markup Compatibility and Extensibility. ISO/IEC 29500‐ 3:2011. [Online] 2011. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59577.

50 51 References

12. —. Office Open XML File Formats ‐‐ Part 4: Transitional Migration Features . ISO/IEC 29500‐ 4:2011. [Online] 2011. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59578.

13. OASIS ‐ Organization for the Advancement of Structured Information Standards. OpenDocument v1.2 Specification ‐ Part 1: OpenDocument Schema. [Online] September 2011. http://docs.oasis‐open.org/office/v1.2. OpenDocument v1.2 OASIS Standard.

14. —. OpenDocument v1.2 Specification ‐ Part 2: Recalculated Formula (OpenFormula) Format. [Online] September 2011. http://docs.oasis‐open.org/office/v1.2. OpenDocument v1.2 OASIS Standard.

15. —. OpenDocument v1.2 Specification ‐ Part 3: Packages. [Online] September 2011. http://docs.oasis‐open.org/office/v1.2. OpenDocument v1.2 OASIS Standard.

16. —. ODF 1.1 Interoperability Profile, Committee Draft 03. [Online] June 2010. http://docs.oasis‐ open.org/oic/odf1.1i/v1.0/CD03/ODF1.1‐InteropProfile‐v1.0‐cd03.pdf.

17. —. The State of Interoperability v1.0, Committee Specification 01. [Online] December 2010. http://docs.oasis‐open.org/oic/StateOfInterop/v1.0/StateOfInterop.pdf.

18. ISO/IEC JTC 1/SC34. Document Schema Definition Languages ‐ DSDL. ISO/IEC 19757 ‐ DSDL. [Online] http://dsdl.org/.

19. IOS/IEC JTC 1/SC34. Office Open XML File Formats ‐‐ Part 4: Transitional Migration Features ‐ Technical Corrigendum 1. 2012. Bd. ISO/IEC JTC 1/SC 34/WG 4 N 0213.

20. Di Iorio, Angelo. Pattern‐based Segmentation of Digital Documents: Model and Implementation. Bologna : s.n., 2007.

21. ISO/IEC. ISO concept database . [Online] ISO. http://www.iso.org/iso/iso_concept_database_cdb.

22. W3C. XProc: An XML Pipeline Language. [Online] 2010. Mai 2010. http://www.w3.org/TR/xproc/.

23. —. XSL Transformations (XSLT) Version 1.0. [Online] November 1999. http://www.w3.org/TR/xslt.

24. Oracle, Sun Microsystems /. JavaTM 2 Platform, Standard Edition, v 1.4.2. [Online] http://download.oracle.com/javase/1.4.2/docs/api/java/net/JarURLConnection.html.

25. Jelliffe, Rick. A simple method for integrating XML‐in‐ZIP formats into DSDL. [Online] 2010. http://lists.dsdl.org/dsdl‐discuss/2010‐03/0008.html.

26. Feature Driven Profiling of Open Standards for Office Applications. Kirchhoff, Björn. [Hrsg.] IEEE Xplore. Berlin : s.n., 2011. The 7th International Conference on Standardization and Innovation in Information Technology SIIT2011.

27. ISO/IEC. ISO/IEC Directives and complementary documents. [Online] 2006. http://isotc.iso.org/livelink/livelink/fetch/2000/2489/Ittf_Home/Directives.html.

51 52 References

28. OSB Alliance. Specification of "Layout‐true Representation of OOXML Documents in Open Source Office Applications". [Online] December 2011. http://osb‐ alliance.com/images/stories/PDF_Files/specificationooxmlimprovements_en_v06.pdf.

29. —. Open Source Business Alliance. Open Source Business Alliance. [Online] OSB Alliance. http://osb‐alliance.com/.

30. The Document Foundation. LibreOffice. LibreOffice. [Online] The Document Foundation. http://www.libreoffice.org/.

31. Fraunhofer FOKUS. Center for Interoperability. Zentrum für Interoperabilität. [Online] Fraunhofer FOKUS. http://www.interoperability‐center.com/en/.

32. Vitali, Fabio und Marinelli, Paolo. Interoperability across different ISO/IEC file formats: the pentaformat approach. Paris : University of Bologna, 2009.

33. Krowne, Aaron. Zipf's law. Zipf's law. [Online] PlanetMath.org. http://planetmath.org/encyclopedia/ZipfsLaw.html.

52