Electronic Document Technol
Total Page:16
File Type:pdf, Size:1020Kb
Abstract Electronic Document Technology Standards PDF, PDF/A, OOXML, OpenDocument. What is the alphabet soup? In recent years technologists and Signatures have been attempting to make electronic documents more transportable across systems and displays as well as improving their usability. This session will explain these various document E-Courts 2008, Las Vegas formats and how your court can use the technology to improve data capture, display, and information security. Erik Wilde, UC Berkeley School of Information December 9, 2008 This work is licensed under a CC Attribution 3.0 Unported License About this Presentation About Me Outline Outline 1. About this Presentation [6] 1. About this Presentation [6] 1. About Me [1] 1. About Me [1] 2. About ISD [5] 2. About ISD [5] 2. Document Standards [18] 2. Document Standards [18] 3. Document Security [7] 3. Document Security [7] About Me About ISD Computer Science at Technical University of Berlin (TUB) (88-91) Ph.D. at ETH Zürich (92-97) Post-Doc at ICSI, Berkeley (97/98) Outline Various activities back in Switzerland (98-06) teaching at ETH Zürich and FHNW 1. About this Presentation [6] working as independent consultant (training, courses, consulting) 1. About Me [1] research in various XML-related areas 2. About ISD [5] Professor at the School of Information (since Fall 2006) 2. Document Standards [18] Technical Director of the Information and Service Design (ISD) program 3. Document Security [7] Information and Service Design (ISD) Information-Intensive Applications Part of UC Berkeley's School of Information Traditional enterprise IT solutions have limits Connecting our students with real-world scenarios and projects built for long life-cycles of deployed system architectures “Building Stuff That Actually Works” built for integration of existing systems into a unified landscape getting involved in project management and associated challenges Many enterprise IT solutions cannot keep up very well understanding the real-world challenges of information modeling by definition, they never completely fail Focus on open information systems and open information access they dictate the shape and direction of information flows “usability” and “accessibility” should become terms beyond the UI realm The Web is by far the biggest information system that ever existed Example areas of ISD interest built around an astonishingly primitive data model e-Books beyond “iTunes for books”: open formats, flexible reuse the simplicity is not a deficiency, it is a feature open data for field researchers: sharing information as simply as possible everybody can cooperate as long as there is minimal agreement location on the Web: how to turn the Web into a location-aware system the Web's architectural principle can be reused for enterprise IT Project: Environmental Data Project: Justice and the Criminal Record Government agencies collect and manage a lot of environmental data Criminal records are important for background checks some of it is accessible in historical or current archives companies collect information and are re-sellers some of it is permanently produced by sensors there is no expiration date for this information Large-scale data aggregation presents various challenges Criminal record information changes in important ways implementation issues of sensor deployment and management new entry: important for background check (false negative) organization issues of classifying and grouping sensors expunged entry: not critical for background check (false positive) access issues of being able to access subsets of the available data little business incentives for companies to properly delete entries policy issues of sensible data and possible access restrictions Information accessibility can introduce new challenges Web architecture presents a proven path for large-scale systems how to hold people accountable for providing outdated data built on loose coupling and cooperation rather than integration how to create incentives for properly updating data built on a different architecture than traditional enterprise IT Dream Project: Services, not Sites Document Standards Government agencies should provide services, not sites Sites are hard to build and hard to maintain often built with specific use cases in mind Outline technology evolves and sites must be maintained to keep up Sites get in the way of services 1. About this Presentation [6] often service access is possible only through a site 2. Document Standards [18] Services provide all the necessary information 1. Application-Independent Formats [10] exposing what the public has paid for 2. Application-Specific Formats [6] not spending public money for building interfaces 3. Document Security [7] Policy issues around service design and information usage information licenses must be developed to avoid data rot “eat your own dogfood” is a good start, but not sufficient (tastes differ) REST Document Exchange as Business The Web is built on Representational State Transfer (REST) Interactions resources are the “units of interest” in any REST design peers interact by exchanging representations of resources Traditional enterprise IT is based on integration interactions can only use a small number of predefined verbs (4 in HTTP) model the complete system as one big distributed program state transitions are using hypertext as the engine of application state implement the system using some distributed programming environment Documents often are the core part of a RESTful system architecture programming is based on the abstraction of building one big system the only absolute core part of REST is identification (URIs) Web architecture is based on cooperation communications are often based on HTTP(-S) there is no overarching model, there are only local models representations often use HTML or some XML vocabulary peers can interact by exchanging information about resources representations have primacy over functions or interactions cooperation is achieved by agreeing on representations of resources there should be no assumptions about availability, links can always break Names for the debate: “REST vs. SOAP” or “REST vs. WS-*” This is an ongoing debate and will not go away anytime soon Application-Independent Formats History of Document Interchange Plain text and structured text plain text only needs agreement on a common character set (e.g., ASCII or Unicode) Outline first data formats were comma-delimited or tab-delimited structures SGML (Standard Generalized Markup Language) was the first open document format 1. About this Presentation [6] XML (Extensible Markup Language) streamlined SGML to become usable on the Web 2. Document Standards [18] Document formats vs. data formats 1. Application-Independent Formats [10] data formats represent database-like structures (e.g., UML or ER) 2. Application-Specific Formats [6] document formats represent narrative documents structures 3. Document Security [7] many existing document collections use something in the middle many applications need something in the middle Structured Documents HTML Most real-world data is semi-structured or unstructured HTML is the standard document format on the Web documents use titles, paragraphs, lists, and tables Microformats can be used to improve document semantics documents do not mark up person names, place names, … earlier microformats were not based on a common syntax Natural Language Processing (NLP) tries to extract structures RDFa (October 2008) provides a standardized syntax IT people want structured data, users often don't like forms Why HTML often is not even considered as a document format building good UIs is one of the core tasks for acceptance designed for logical structures, so rendering depends on clients badly designed data entry is sabotaged and produces garbage designed for continuous display, so paged content is not a natural fit provide feedback about the benefits of good data poor print support in regular browsers (problem of CSS and bad browser support) XML is a language for building languages, but don't do it Why HTML should be considered as a document format XML does not define any semantics (i.e., it only defines structures) focus on content structures rather than rendering XML supports semi-structured data (supporting incremental refinement) easy to adapt to a wide variety of clients vocabularies define structure and semantics of XML document types printing problem can be solved with custom print processing vocabularies may provide/use modules, thereby allowing flexible reuse PDF PDF Data Portable Document Format (PDF) evolved from a printer language PDF has evolved into a multimedia container format based on PostScript, a page description language for printers support for various media types such as images, audio, and video removed some programming features, added a lot of file format features PDF forms allow interactive forms to be created and filled out Acrobat Reader as a free product made PDF successful scripting can be used to further support interactive PDF the “give away the reader, charge for the writer” strategy extensions allow 3D models to be embedded into PDF PDF has become a complex and complicated specification Text can also appear in a variety of ways successful commercial products add features, which add data format complexity embedded images from scanning processes may only show text images backwards compatibility almost always means that no features will be removed Optical Character Recognition (OCR) may result in poorly recognized characters