A Characterization of Compound Documents on the Web

Total Page:16

File Type:pdf, Size:1020Kb

A Characterization of Compound Documents on the Web A Characterization of Compound Documents on the Web Eyal de Lara†, Dan S. Wallach‡, and Willy Zwaenepoel‡ † Department of Electrical and Computer Engineering ‡ Department of Computer Science Rice University Abstract 3. For small documents, XML format produces much larger documents than OLE. For large Recent developments in office productivity suites documents, there is little difference. make it easier for users to publish rich compound documents on the Web. Compound documents ap- 4. Compression considerably reduces the size of pear as a single unit of information but may con- documents in both formats. tain data generated by different applications, such as text, images, and spreadsheets. Given the popu- 1 Introduction larity enjoyed by these office suites and the perva- siveness of the Web as a publication medium, we Productivity tools, often part of “office suites,” are expect that in the near future these compound doc- the most popular applications for creating docu- uments will become an increasing proportion of the ments. Their popularity derives, to some extent, from Web’s content. As a result, the content handled by their capability to create compound documents that servers, proxies, and browsers may change consider- include data from more than one application. ably from what is currently observed. Furthermore, The documents produced by office suites have these compound documents are currently treated as been, until recently, very unwieldy for putting on the opaque byte streams, but future Web infrastructure Web. These documents can be quite large, some- may wish to understand their internal structure to times hundreds of kilobytes, and Web browsers do provide higher-quality service. not know how to display them without first com- In order to guide the design of this future Web in- pletely downloading the document. frastructure, we characterize compound documents Recent improvements in office suites have made it currently found on the Web. Previous studies of Web easier to publish compound documents on the Web content either ignored these document types alto- by exporting browser-compatible data types. These gether or did not consider their internal structure. We improvements, coupled with the popularity of pro- study compound documents originated by the three ductivity tools, will likely have a strong impact on most popular applications from the Microsoft Of- the content of the Web and the infrastructure that fice suite: Word, Excel, and PowerPoint. Our study supports this content. Previous studies have either encompasses over 12,500 documents retrieved from overlooked compound document altogether or have 935 different Web sites. Our main conclusions are: treated them as black boxes [2, 18, 16, 1, 17]. There- fore, it becomes important to study and characterize 1. Compound documents are in general much compound documents on the Web, as they are now, larger than current HTML documents. to predict where the Web might be going, and how Web infrastructure should support compound docu- 2. For large documents, embedded objects and im- ments in the future. ages make up a large part of the documents’ For this paper, we studied compound documents size. generated by the three most popular application of 1 the Microsoft Office suite: Word, Excel, and Power- tain elements created by different applications. A Point. We chose to focus on Microsoft Office appli- compound document could, for instance, consist of cations based on two factors. First, Microsoft Office a spreadsheet and several images embedded into a is the most widely-used productivity suite. More- text document. over, a significant number of Microsoft Office docu- In the general case, every data type in a compound ments are available on the Web, enabling us to gather document (spreadsheet, text, images, sound, etc) is the data for our experiments. Second, Microsoft Of- created and managed by a different application. The fice 2000 supports two native file formats: the propri- different applications used to create the document etary OLE-based binary format and a new XML for- can be thought of as software components that pro- mat. By using Office 2000 to convert old files to the vide services that are invoked to create, edit, and dis- new XML format, we can compare the tradeoffs of play the compound document. using a proprietary binary-based file format against a modern standards-based text format, both as inter- mediate formats suitable for document editing, and as publishing formats, suitable only for reading. 2.2 Enabling technologies We downloaded over 12500 documents, compris- ing over 4 GB of data, from 935 different sites. Our 2.2.1 Component standards main results are: Compound documents result from combining the 1. Compound documents are in general much data created by two or more disjoint software com- larger than current HTML documents. The tail ponents. As a result, there is a need for a standard to of the size distribution follows the same power- govern the interactions between components. Some log distribution observed with HTML docu- of the most visible standards are COM/OLE [3, 4], ments. SOM/OpenDoc [14], and JavaBeans [7]. Among other things, such standards define how components 2. For large documents, embedded objects and im- are uniquely identified, how they are stored on disk, ages comprise a large part of the document. and how they interact with one another and system 3. For small documents, XML format produces resources such as the screen. much larger documents than OLE. For large documents, there is little difference. 4. Compression considerably reduces the size of 2.2.2 Storage documents in both formats. Typical container applications keep in persistent stor- The rest of this document is organized as follows. age two versions of the components they embed. The Section 2 provides some background on compound first one consists of the embedded component’s na- documents and their enabling technology. We also tive data, which is used to initialize the component. discuss relevant characteristics of the three Microsoft This data is created and managed by the component Office applications that we use in this study. Sec- itself. The second representation is a cached image tion 3 describes the documents we used in our exper- of the state of the component the last time it was iments. Section 4 presents our experimental results. instantiated. This image, although created by the Finally, section 5 discusses our conclusions. component, is managed by the container application. This image serves two purposes. First, it allows the document to be rendered quickly, since the code that 2 Background understands the component’s specific type need not 2.1 Compound documents be executed until the user wishes to modify the com- ponent. Second, the cached image allow the doc- To its user, a compound document appears to be a ument to be rendered even on systems where some single unit of information, but in fact it can con- components are not available. 2 2.3 Microsoft Office parent container need not understand the information stored within it. In this section we explore the technologies used by MS-Office to enable compound documents. We start with an overview of COM and OLE. We focus on 2.3.3 Component taxonomy the parts of these technologies that are important Conceptually, MS-Office documents may have up for compound document, particularly the Structured to three classes of components: images, OLE-based Storage Interface. Finally we talk about the two na- embedded components, and virtual components. Im- tive file formats supported by MS-Office. ages are graphic data that are stored and manipu- lated directly by the application. This includes the 2.3.1 COM and OLE cached versions of any embedded components and any graphic data that the application choose to ma- The Component Object Model (COM) enables soft- nipulate directly. OLE-based embedded components ware components to export well-defined interfaces are data created using a separate application, as de- and interact with one another. In COM, software scribed above. Finally, virtual components are ob- components implement their services as one or more jects that are not implemented as OLE-based com- COM objects. Every object implement one or more ponents but that are perceived by the user as separate interfaces, each of which exports a number of meth- entities (i.e., pages in Word, slides in PowerPoint, ods. COM components communicate by invoking and sheets in Excel). these methods. The Object Linking and Embedding (OLE) speci- 2.3.4 File formats fication is a set of standard COM interfaces that en- able users to create compound document by linking Microsoft Office 2000 supports two native file for- and embedding objects (components) into container mats: the traditional OLE-based binary format (here- applications, hence the name OLE. OLE includes after, “OLE archive”) and a new XML-based for- other component-based technologies, like ActiveX mat. The OLE archives [11, 10, 8, 9] rely on the and OLE Automation. OLE SSI to provide a unified view of the compound document in a single file. However, the manner in 2.3.2 Structured storage which MS-Office applications use the OLE SSI to store embedded object varies. Word and Excel, for The OLE Structure Storage Interface (SSI) provides example, store every embedded object in a separate the means for multiple components to share a single storage, making the component structure of the docu- file. This is necessary because in the user’s percep- ment visible to the OLE SSI. In contrast, PowerPoint tion, a compound document is single unit of infor- compresses embedded object native data and stores mation and should appear as a single file. it in the main application stream. While this strategy The SSI implements an abstraction similar to a file increases document compression, it limits the ability system within a single file.
Recommended publications
  • Document-Centered User Interfaces & Object-Oriented Programming
    T HE Vol. 1, No. 6 January/February G ILBANE1994 ™ Publisher: Publishing Technology Management, Inc. Applelink: PTM REPOR (617) 643-8855 Editor: T Frank Gilbane on Open Information & Document Systems [email protected] (617) 643-8855 Subscriptions: It may not seem these two Carolyn Fine DOCUMENT-CENTERED USER INTERFACES & [email protected] topics have much to do (617) 643-8855 OBJECT-ORIENTED PROGRAMMING — HOW with each other at first, but Design & Production: WILL THEY AFFECT YOU? together they will have a Catherine Maccora profound affect on how we (617) 241-7816 use computers in everyday business environments. You can get a glimpse of where this is Associate Editor: Chip Canty all headed by looking at some existing document management software and operating [email protected] system features. In this issue Keith Dawson explains why these two trends are important, (617) 265-6263 how they relate to each other, and how we can expect their convergence to affect our Contributing Editor: Rebecca Hansen development and use of software for business applications. MCI Mail: 4724078 This issue also sets the stage for a future article analyzing the compound document (617) 859-9540 architectures that will be vying for dominance in the next generation document computing environments. NNIVERSARY SSUE It seems hard to believe, but with this issue we conclude our first A I year of publication. It has been an exciting year for us, and we look forward to another year of keeping you up-to-date on the issues and technology for managing documents and document information. In our next (anniversary) issue we will look at how Andersen Consulting helped the State of Wisconsin Legislature reengineer their entire document workflow and management process, and put together a document management application that involved integrating multiple document management products.
    [Show full text]
  • Universität Karlsruhe – Fakultät Für Informatik – Bibliothek – Postfach 6980 – 76128 Karlsruhe
    Universität Karlsruhe – Fakultät für Informatik – Bibliothek – Postfach 6980 – 76128 Karlsruhe Software-Komponentenmodelle – Seminar im Wintersemester 2006/2007 an der Universität Karlsruhe (TH), Institut für Programmstrukturen und Datenorganisation, Lehrstuhl für Software-Entwurf und -Qualität Herausgeber: Steffen Becker, Jens Happe, Heiko Koziolek, Klaus Krogmann, Michael Kuperberg, Ralf Reussner Interner Bericht 2007-10 Universität Karlsruhe Fakultät für Informatik ISSN 1432 - 7864 2 Inhaltsverzeichnis Open-Source Component Models: Bonobo and KParts 5 Sebastian Reichelt The Palladio Component Model 25 Erik Burger The SOFA Component Model 45 Igor Goussev Das Koala Komponentenmodell 67 Dimitar Hodzhev XPCOM Cross Platform Component Object Model 88 Fouad ben Nasr Omri 3 4 Open-Source Component Models: Bonobo and KParts Sebastian Reichelt Supervisor: Michael Kuperberg Karlsruhe University, Karlsruhe, Germany Abstract. Most modern desktop computers are controlled via graphical user interfaces. For different tasks, the user usually has to start separate programs with visually separated user interfaces. Complex tasks require either large monolithic programs or the possibility to reuse part of the functionality provided by one program in another. Such reuse can be achieved using component frameworks, which may be generic or special- ized for graphical applications. In Microsoft Windows, OLE is a widely known specialized component framework built on top of the generic COM framework. The open-source desktop systems GNOME and KDE support graphical components us- ing the Bonobo and KParts subsystems, respectively. This paper is an analysis of these two open-source component frameworks. Despite similar goals, Bonobo and KParts are designed and implemented very differently. In this paper, they are compared from a user’s, program- mer’s, and system architect’s point of view.
    [Show full text]
  • Locah"CONANERPAR SHARED O CONTENT PAR SERVER 112 Pitchpart EDITOR
    USOO6681371B1 (12) United States Patent (10) Patent No.: US 6,681,371 B1 Devanbu (45) Date of Patent: Jan. 20, 2004 (54) SYSTEMAND METHOD FOR USING (56) References Cited CONTAINER DOCUMENTS AS MULTI-USER DOMAIN CLIENTS U.S. PATENT DOCUMENTS (75) Inventor: Premkumar T. Devanbu, Davis, CA 4,933,880 A * 6/1990 Borgendale et al. ........ 71.5/515 5,781,732 A * 7/1998 Adams ....................... 709/205 (US) 5,870.764 A * 2/1999 Lo et al. ..................... 707/203 6,054,985 A * 4/2000 Morgan et al. ............. 345/804 (73) Assignee: AT&T Corp., New York, NY (US) 6,263.379 B1 * 7/2001 Atkinson et al. ........... 709/332 (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 * cited by examiner U.S.C. 154(b) by 0 days. Primary Examiner Hoang-Vu. Antony Nguyen-Ba (21) Appl. No.: 09/469,169 ABSTRACT (22) Filed: Dec. 21, 1999 (57) A multi-user domain is implemented in a compound docu Related U.S. Application Data ment framework to collaboratively modify a compound (60) Provisional application No. 60/113,233, filed on Dec. 21, document in accordance with a concurrency model. A Server 1998. hosts a multi-user domain in which a plurality of clients (51) Int. Cl............................ G06F 15700; G06F 9/44; participate. The multi-user domain includes a compound G06F 9/00 document having a shared container part and at least one (52) U.S. Cl. ........................ 715/515; 717/123; 709/328 shared content part. Each participating client can use its own (58) Field of Search ................................
    [Show full text]
  • A Document-Centric Component Framework for Document Distributions
    A Document-centric Component Framework for Document Distributions Ichiro Satoh National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan Abstract. This paper presents a framework for building and managing com- pound documents in distributed systems. It enables an enriched document to be dynamically and nestedly composed of software components corresponding to various types of content, e.g., text, images, and windows. It permits the content of each component and program code to access the content inseparable inside the components so that the components can be viewed or modified without the need for any applications. It enables each component or document to migrate over a network under its own control by using mobile agent technology. Moreover, it in- troduces components as carriers or forwarders because it enables them to carry or transmit other components as first class objects to other locations. It offers sev- eral basic operations for network processing, e.g., forwarding, duplication, and synchronization. Since these operations are still document components, they can be dynamically deployed and customized at local or remote computers through GUI manipulations. It therefore allows an end-user to easily and rapidly configure network processing in the same way as if he/she had edited the documents. This paper describes the framework and its implementation, which currently uses Java as the implementation language as well as a component development language, and then illustrates several interesting applications that demonstrate its utility and flexibility. 1 Introduction Document manipulation, such as editing, viewing, and distributing documents, is still playing a crucial role in modern information processing.
    [Show full text]
  • Overview of Document Management Technology
    International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME OCCASIONAL PAPER 2 OVERVIEW OF DOCUMENT MANAGEMENT TECHNOLOGY Gary Cleveland National Library of Canada June, 1995 International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME The IFLA Core Programme on Universal Dataflow and Telecommunications (UDT) seeks to facilitate the international and national exchange of electronic data by providing the library community with pragmatic approaches to resource sharing. The programme monitors and promotes the use of relevant standards, promotes the use of relevant technologies and monitors relevant policy issues in an effort to overcome barriers to the electronic transfer of data in library fields. CONTACT INFORMATION Mailing Address: IFLA International Office for UDT c/o National Library of Canada 395 Wellington Street Ottawa, CANADA K1A 0N4 UDT Staff Contacts: Leigh Swain, Director Email: [email protected] Phone: (819) 994-6833 or Louise Lantaigne, Administration Officer Email: [email protected] Phone: (819) 994-6963 Fax: (819) 994-6835 Email: [email protected] URL: http://www.ifla.org/udt/ Occasional papers are available electronically at: http://www.ifla.org/udt/op/ UDT Occasional Papers # 2 Universal Dataflow and Telecommunications Core Programme International Federation of Library Associations and Institutions Overview of Document Management Technology Gary Cleveland National Library of Canada [email protected] June, 1995 INTRODUCTION Traditionally, there have been two classes of document management: 1) management of fixed Current document management technology grows images of pages (the class that seems to be most out of the business community where some 80% of familiar to librarians); and 2) management of corporate information resides in documents.
    [Show full text]