Abstract Electronic Document Technology Standards PDF, PDF/A, OOXML, OpenDocument. What is the alphabet soup? In recent years technologists and Signatures have been attempting to make electronic documents more transportable across systems and displays as well as improving their usability. This session will explain these various document E-Courts 2008, Las Vegas formats and how your court can use the technology to improve data capture, display, and information security. Erik Wilde, UC Berkeley School of Information December 9, 2008

This work is licensed under a CC Attribution 3.0 Unported License

About this Presentation About Me

Outline Outline

1. About this Presentation [6] 1. About this Presentation [6] 1. About Me [1] 1. About Me [1] 2. About ISD [5] 2. About ISD [5] 2. Document Standards [18] 2. Document Standards [18] 3. Document Security [7] 3. Document Security [7] About Me About ISD

Computer Science at Technical University of Berlin (TUB) (88-91) Ph.D. at ETH Zürich (92-97) Post-Doc at ICSI, Berkeley (97/98) Outline Various activities back in Switzerland (98-06) teaching at ETH Zürich and FHNW 1. About this Presentation [6] working as independent consultant (training, courses, consulting) 1. About Me [1] research in various XML-related areas 2. About ISD [5] Professor at the School of Information (since Fall 2006) 2. Document Standards [18] Technical Director of the Information and Service Design (ISD) program 3. Document Security [7]

Information and Service Design (ISD) Information-Intensive Applications

Part of UC Berkeley's School of Information Traditional enterprise IT solutions have limits Connecting our students with real-world scenarios and projects built for long life-cycles of deployed system architectures “Building Stuff That Actually Works” built for integration of existing systems into a unified landscape getting involved in project management and associated challenges Many enterprise IT solutions cannot keep up very well understanding the real-world challenges of information modeling by definition, they never completely fail Focus on open information systems and open information access they dictate the shape and direction of information flows “usability” and “accessibility” should become terms beyond the UI realm The Web is by far the biggest information system that ever existed Example areas of ISD interest built around an astonishingly primitive data model e-Books beyond “iTunes for books”: open formats, flexible reuse the simplicity is not a deficiency, it is a feature open data for field researchers: sharing information as simply as possible everybody can cooperate as long as there is minimal agreement location on the Web: how to turn the Web into a location-aware system the Web's architectural principle can be reused for enterprise IT Project: Environmental Data Project: Justice and the Criminal Record

Government agencies collect and manage a lot of environmental data Criminal records are important for background checks some of it is accessible in historical or current archives companies collect information and are re-sellers some of it is permanently produced by sensors there is no expiration date for this information Large-scale data aggregation presents various challenges Criminal record information changes in important ways implementation issues of sensor deployment and management new entry: important for background check (false negative) organization issues of classifying and grouping sensors expunged entry: not critical for background check (false positive) access issues of being able to access subsets of the available data little business incentives for companies to properly delete entries policy issues of sensible data and possible access restrictions Information accessibility can introduce new challenges Web architecture presents a proven path for large-scale systems how to hold people accountable for providing outdated data built on loose coupling and cooperation rather than integration how to create incentives for properly updating data built on a different architecture than traditional enterprise IT

Dream Project: Services, not Sites Document Standards

Government agencies should provide services, not sites Sites are hard to build and hard to maintain often built with specific use cases in mind Outline technology evolves and sites must be maintained to keep up Sites get in the way of services 1. About this Presentation [6] often service access is possible only through a site 2. Document Standards [18] Services provide all the necessary information 1. Application-Independent Formats [10] exposing what the public has paid for 2. Application-Specific Formats [6] not spending public money for building interfaces 3. Document Security [7] Policy issues around service design and information usage information licenses must be developed to avoid data rot “eat your own dogfood” is a good start, but not sufficient (tastes differ) REST Document Exchange as Business

The Web is built on Representational State Transfer (REST) Interactions resources are the “units of interest” in any REST design peers interact by exchanging representations of resources Traditional enterprise IT is based on integration interactions can only use a small number of predefined verbs (4 in HTTP) model the complete system as one big distributed program state transitions are using hypertext as the engine of application state implement the system using some distributed programming environment Documents often are the core part of a RESTful system architecture programming is based on the abstraction of building one big system the only absolute core part of REST is identification (URIs) Web architecture is based on cooperation communications are often based on HTTP(-S) there is no overarching model, there are only local models representations often use HTML or some XML vocabulary peers can interact by exchanging information about resources representations have primacy over functions or interactions cooperation is achieved by agreeing on representations of resources there should be no assumptions about availability, links can always break Names for the debate: “REST vs. SOAP” or “REST vs. WS-*” This is an ongoing debate and will not go away anytime soon

Application-Independent Formats History of Document Interchange

Plain text and structured text only needs agreement on a common character set (e.g., ASCII or Unicode) Outline first data formats were comma-delimited or tab-delimited structures SGML (Standard Generalized ) was the first open document format 1. About this Presentation [6] XML (Extensible Markup Language) streamlined SGML to become usable on the Web 2. Document Standards [18] Document formats vs. data formats 1. Application-Independent Formats [10] data formats represent database-like structures (e.g., UML or ER) 2. Application-Specific Formats [6] document formats represent narrative documents structures 3. Document Security [7] many existing document collections use something in the middle many applications need something in the middle Structured Documents HTML

Most real-world data is semi-structured or unstructured HTML is the standard document format on the Web documents use titles, paragraphs, lists, and tables Microformats can be used to improve document semantics documents do not mark up person names, place names, … earlier microformats were not based on a common syntax Natural Language Processing (NLP) tries to extract structures RDFa (October 2008) provides a standardized syntax IT people want structured data, users often don't like forms Why HTML often is not even considered as a document format building good UIs is one of the core tasks for acceptance designed for logical structures, so rendering depends on clients badly designed data entry is sabotaged and produces garbage designed for continuous display, so paged content is not a natural fit provide feedback about the benefits of good data poor print support in regular browsers (problem of CSS and bad browser support) XML is a language for building languages, but don't do it Why HTML should be considered as a document format XML does not define any semantics (i.e., it only defines structures) focus on content structures rather than rendering XML supports semi-structured data (supporting incremental refinement) easy to adapt to a wide variety of clients vocabularies define structure and semantics of XML document types printing problem can be solved with custom print processing vocabularies may provide/use modules, thereby allowing flexible reuse

PDF PDF Data

Portable Document Format (PDF) evolved from a printer language PDF has evolved into a multimedia container format based on PostScript, a page description language for printers support for various media types such as images, audio, and video removed some programming features, added a lot of file format features PDF forms allow interactive forms to be created and filled out Acrobat Reader as a free product made PDF successful scripting can be used to further support interactive PDF the “give away the reader, charge for the writer” strategy extensions allow 3D models to be embedded into PDF PDF has become a complex and complicated specification Text can also appear in a variety of ways successful commercial products add features, which add data format complexity embedded images from scanning processes may only show text images backwards compatibility almost always means that no features will be removed Optical Character Recognition (OCR) may result in poorly recognized characters Microsoft wants a piece of the pie with its XML Paper Specification (XPS) formatting software might include rendered characters (e.g., “fi” vs. “fi”) PDF 1.7 is the latest version (implemented by Acrobat 9.0) formatted text might use non-embedded fonts published by ISO as ISO 32000-1:2008 in November 2008 rendering PDF is a challenging task searching PDF might be difficult or impossible PDF Metadata PDF/X

Metadata (data about data) is essential for document management ISO-standardized PDF profile for pre-print document exchange it can be managed as an integral part of documents focus on high fidelity rendering of PDF documents it can be managed externally by having metadata records color spaces must be specified (important for printing) External metadata allows unified rules for metadata management all fonts must be embedded the same metadata can be captured for all resources various boxes must be defined for specifying the print area works for resource types with no metadata capabilities (e.g., books) PDF/X is not a good choice for non-production workflows Embedded metadata creates self-contained documents often very specific for one publishing workflow packaging issues become easier no constraints that focus on document management properties flexible embedded metadata formats support user-defined metadata models PDF supports various kinds of embedded metadata earlier versions had a small set of hardcoded metadata fields Extensible Metadata Platform (XMP) for extensible metadata (since PDF 1.4)

PDF/A OpenDocument (ODF)

ISO-standardized PDF profile for archiving PDF documents Developed as the native format for OpenOffice focus on long-term archiving of PDF documents Standardized by ISO as ISO/IEC 26300:2006 color spaces must be specified (important for printing) Main starting point was the need for an open office format all fonts must be embedded Microsoft's Office products used undocumented file formats audio/video content and scripting are not allowed document management should be based on documents, not products PDF/A is a good choice for archiving workflows ODF's success forced Microsoft to open the Office file formats documents should be verified before accepting them as PDF/A in 2005, Massachusetts stated that open formats should be used for public data minimal amount of metadata must be embedded in 2007, Massachusetts added OOXML to the list of approved formats PDF/A-1b only focuses on the visual appearance of a document Disadvantages of ODF scanned pages can be contained as images only not as widely supported (but getting there) PDF/A-1a also focuses on the content of a document currently no standardized digital signature format (ODF 1.2) tagged PDF supports searching and repurposing of document contents OOXML Application-Specific Formats

Microsoft started OOXML as a response to ODF's challenge OOXML was blessed by ECMA (XPS uses the same strategy) ECMA is often used as a simple first step in standardization Outline ECMA-approved specs can be fast-tracked in ISO Microsoft's tactics caused a lot of controversy among experts 1. About this Presentation [6] OOXML is a compressed package of various resources 2. Document Standards [18] the Open Packaging Conventions (OPC) create an archive of all resources 1. Application-Independent Formats [10] OOXML is a structured archive with conventions for its contents 2. Application-Specific Formats [6] Disadvantages of OOXML 3. Document Security [7] 6'500 pages of file format specification many redundancies for historical reasons (e.g., three different table models) the document XML format is not easy to process

Why Use XML? Is XML Self-Describing?

Because you want to share data XML is often said to be “self-describing” share it in a format which is widely used and easy to use many people think this is the same as “self-explanatory” enable others to use it on various platforms with existing tools the catch is what exactly it is you refer to by “describing” Because you want to share data cheaply Database data cannot live without a database it is easier to use XML than to invent something new database data is simply content, the structure is provided by a DBMS it is even easier to use an existing XML schema than to invent a new one XML documents have their structure encoded within them Because you want to share data openly compared to database data, XML in fact is “self-describing” if you invent new formats, people must process them What is the gap between “self-describing” and “self-explanatory”? avoid applying the “security through obscurity” principle inadvertently it is impossible to find out how the document could be modified application-specific processing should be deferred to higher layers there are no semantics associated with neither structure nor content so “self-describing” means, you can guess a lot, but you maybe wrong XML is Syntax XML is Character-Based

XML documents can use a wide array of characters. They are defined by Unicode, which currently XML is not a binary format, it is (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005). “binary structures” cannot (or rather should not) be described using XML Multimedia formats often are binary image formats such as GIF, JPEG, and PNG audio formats such as MP3 and AAC ̞ധ˥ᮕ᭤᭳ video formats such as MPEG4 and H.264 ᭍᭢ᮋᮦ᭍ᮁ᭴᭐ᮔᮧũᯘ᭞ᮉᮧ᭤᭸᭮᭳ʀƓ͊ണᯙ But: multimedia also uses many XML formats ᭍ᮔᮞᮦᯱᮦᮌ᭮᭞ᮔᮧũᯘᮌ᭮᭞ᮔᮧᮌ᭲᭎᭍ˣണŸ᯳᯵᯿ᯙ vector graphics formats such as Scalable Vector Graphics (SVG) ᭍ᮔᮞᮦ᭠ᮖ᭲᭎᭬᭮ũᯘᮌ᭳ᮕ᭞ᮋ᭲᭎ᮗ᭚᭪ᮧᯙ Synchronized Multimedia Integration Language (SMIL) for describing presentations ᭑᭎᭤᭪ᮧᮦ᭑᭔ᮖ᭞᭮᭳ũᯘ᭼᭏ᮘ᭮᭳᭸᭮᭳ᮚᮧ᭚ᮦ᭠ᮧ᭾᭢᭥Ņøണᯙ ᮦ᭓ᮕ᭮᭚ᮦᮕᮞ᭝ᮚᮖ᭴ũᯘ᭾ᮧᮦ᭏ᮞ᭚Ņøണᯙ ᭣᭒ᮧᮋ᭤ᮦ᯼ᮦ᭻ᮧ᭚᭤᭲ᮧᮖũᯘ᭸᭮᭳᭤᭜ᮧᮂᮦ᭞ᮊᮐ᭶᭜ᮧ᭢ᮒᮞ᭥øണᯙ japanese1.

<ĵȰ ć܃ųʳ="ᯡᯩᯩᯩᯣȼᯡų"> <໤>᭠ᮞᮂᮖ <ĹΝ>ᬆᬿᬢ᭠ᮞᮂᮖĵȰᬚᬌ᫟ <ĹΝ>ˣøɴ <1้ 1้˓̅ɴ="᭠ᮞᮂᮖ" /> japanese2.xml

XML is a Syntax for Trees XML Usages

Not all data is easily represented by trees XML can be used in different ways overlapping markup (multiple “views” of the same content) people should be able to use your XML directly using standard tools graph-like structures which are less constrained than trees if they absolutely need a set of special tools, something is wrong What is it that you have in your tree? XML is hip, so everybody wants to use it XML encodes a structure purely on the syntactic level many things have been created ad-hoc and without much planning what the structures mean is in no way described by XML if you start something which is XML-based, use XML responsibly XML structures must be accompanied by semantic descriptions if you have to use some “bad XML”, complain about it Finding the balance can be hard XML is great for prototyping and experiments once you decide to redesign your XML, it may be too late XML documents may be short-lived, XML schemas are definitely not Document Security Identity

Identity is a central hub of any IT security identity is established by associating digital identities with real entities Outline identities can be grouped and they can have assigned roles authentication is the process of verifying an identity claim 1. About this Presentation [6] access control can be based on identities, groups, or roles 2. Document Standards [18] authorization is the process of providing access to a controlled resource 3. Document Security [7] Authentication is one of the tough problems of IT security 1. One-Way Function [2] usernames and passwords are commonly used 2. Digital Signature [4] additional cues (smartcards, images, biometrics) may be used security questions often are a bad idea for establishing identity Almost all IT security revolves around some “digital identity” users find many ways around inconvenient security implementations

One-Way Function Essence of Data

Hashes (or message digests) are a well-known principle in computer science fast to compute (the goal is to make data handling more efficient) Outline few collisions (there are always collisions because of the smaller size) checksums and Cyclic Redundancy Check (CRC) are popular hashes 1. About this Presentation [6] One-way functions are cryptographically safe hashes 2. Document Standards [18] not just for detecting errors, but also for preventing tampering 3. Document Security [7] often referred to as cryptographic hash or digital fingerprint 1. One-Way Function [2] One-way functions must satisfy some additional criteria 2. Digital Signature [4] it must be very hard to find an input producing a given output it must be very hard to find two inputs producing the same output (“collision”) Reducing Data Digital Signature

Outline

1. About this Presentation [6] 2. Document Standards [18] 3. Document Security [7] 1. One-Way Function [2] 2. Digital Signature [4]

Encrypted Fingerprints Certificate

Hashes are used to check data integrity Certificates are digital signatures issued by a trusted party One-Way Functions are used to check data integrity securely most digital signatures are created with certified public keys it is not possible to reverse engineer data for a given hash this means the digital signature is created based on a digitally signed key Signed hashes can be used to ensure data authenticity Who can you trust on the Web? if the hash sum is signed, it cannot be changed trust can only start to grow based on initial trust in something if the data is changed, its hash will not match the signed hash many systems come with pre-installed trust (root certificates) Digital signatures work as long as the hash can be securely signed certificates from other issuers will cause browsers to complain there must be a trusted Identity for verifying the hash signature Certificates (like domain names) are a very easy way to make money in theory there are different levels of certificates with different levels of identity checking in practice most sites choose the cheapest one that does not produce an error message Creating a Digital Signature Verifying a Digital Signature

Conclusions

IT architecture has two major design phases 1. modeling of information structures and business processes 2. exposing required functionality through an interface/implementation Documents formats are essential for information models build your own model and use existing formats as a guidance provide implementations of the model by mapping it to (existing) formats Information models are the very core of many activities “Getting the Job Done” requires good understanding of the job short term hacks are sufficient for activities with a short term horizon thorough analysis and understanding is required for longevity