A Characterization of Compound Documents on the Web

A Characterization of Compound Documents on the Web Eyal de Lara†, Dan S. Wallach‡, and Willy Zwaenepoel‡ † Department of Electrical and Computer Engineering ‡ Department of Computer Science Rice University Abstract 3. For small documents, XML format produces much larger documents than OLE. For large Recent developments in office productivity suites documents, there is little difference. make it easier for users to publish rich compound documents on the Web. Compound documents ap- 4. Compression considerably reduces the size of pear as a single unit of information but may con- documents in both formats. tain data generated by different applications, such as text, images, and spreadsheets. Given the popu- 1 Introduction larity enjoyed by these office suites and the perva- siveness of the Web as a publication medium, we Productivity tools, often part of “office suites,” are expect that in the near future these compound doc- the most popular applications for creating docu- uments will become an increasing proportion of the ments. Their popularity derives, to some extent, from Web’s content. As a result, the content handled by their capability to create compound documents that servers, proxies, and browsers may change consider- include data from more than one application. ably from what is currently observed. Furthermore, The documents produced by office suites have these compound documents are currently treated as been, until recently, very unwieldy for putting on the opaque byte streams, but future Web infrastructure Web. These documents can be quite large, some- may wish to understand their internal structure to times hundreds of kilobytes, and Web browsers do provide higher-quality service. not know how to display them without first com- In order to guide the design of this future Web in- pletely downloading the document. frastructure, we characterize compound documents Recent improvements in office suites have made it currently found on the Web. Previous studies of Web easier to publish compound documents on the Web content either ignored these document types alto- by exporting browser-compatible data types. These gether or did not consider their internal structure. We improvements, coupled with the popularity of pro- study compound documents originated by the three ductivity tools, will likely have a strong impact on most popular applications from the Microsoft Of- the content of the Web and the infrastructure that fice suite: Word, Excel, and PowerPoint. Our study supports this content. Previous studies have either encompasses over 12,500 documents retrieved from overlooked compound document altogether or have 935 different Web sites. Our main conclusions are: treated them as black boxes [2, 18, 16, 1, 17]. There- fore, it becomes important to study and characterize 1. Compound documents are in general much compound documents on the Web, as they are now, larger than current HTML documents. to predict where the Web might be going, and how Web infrastructure should support compound docu- 2. For large documents, embedded objects and im- ments in the future. ages make up a large part of the documents’ For this paper, we studied compound documents size. generated by the three most popular application of 1 the Microsoft Office suite: Word, Excel, and Power- tain elements created by different applications. A Point. We chose to focus on Microsoft Office appli- compound document could, for instance, consist of cations based on two factors. First, Microsoft Office a spreadsheet and several images embedded into a is the most widely-used productivity suite. More- text document. over, a significant number of Microsoft Office docu- In the general case, every data type in a compound ments are available on the Web, enabling us to gather document (spreadsheet, text, images, sound, etc) is the data for our experiments. Second, Microsoft Of- created and managed by a different application. The fice 2000 supports two native file formats: the propri- different applications used to create the document etary OLE-based binary format and a new XML for- can be thought of as software components that pro- mat. By using Office 2000 to convert old files to the vide services that are invoked to create, edit, and dis- new XML format, we can compare the tradeoffs of play the compound document. using a proprietary binary-based file format against a modern standards-based text format, both as inter- mediate formats suitable for document editing, and as publishing formats, suitable only for reading. 2.2 Enabling technologies We downloaded over 12500 documents, compris- ing over 4 GB of data, from 935 different sites. Our 2.2.1 Component standards main results are: Compound documents result from combining the 1. Compound documents are in general much data created by two or more disjoint software com- larger than current HTML documents. The tail ponents. As a result, there is a need for a standard to of the size distribution follows the same power- govern the interactions between components. Some log distribution observed with HTML docu- of the most visible standards are COM/OLE [3, 4], ments. SOM/OpenDoc [14], and JavaBeans [7]. Among other things, such standards define how components 2. For large documents, embedded objects and im- are uniquely identified, how they are stored on disk, ages comprise a large part of the document. and how they interact with one another and system 3. For small documents, XML format produces resources such as the screen. much larger documents than OLE. For large documents, there is little difference. 4. Compression considerably reduces the size of 2.2.2 Storage documents in both formats. Typical container applications keep in persistent stor- The rest of this document is organized as follows. age two versions of the components they embed. The Section 2 provides some background on compound first one consists of the embedded component’s na- documents and their enabling technology. We also tive data, which is used to initialize the component. discuss relevant characteristics of the three Microsoft This data is created and managed by the component Office applications that we use in this study. Sec- itself. The second representation is a cached image tion 3 describes the documents we used in our exper- of the state of the component the last time it was iments. Section 4 presents our experimental results. instantiated. This image, although created by the Finally, section 5 discusses our conclusions. component, is managed by the container application. This image serves two purposes. First, it allows the document to be rendered quickly, since the code that 2 Background understands the component’s specific type need not 2.1 Compound documents be executed until the user wishes to modify the component. Second, the cached image allow the doc- To its user, a compound document appears to be a ument to be rendered even on systems where some single unit of information, but in fact it can con- components are not available. 2 2.3 Microsoft Office parent container need not understand the information stored within it. In this section we explore the technologies used by MS-Office to enable compound documents. We start with an overview of COM and OLE. We focus on 2.3.3 Component taxonomy the parts of these technologies that are important Conceptually, MS-Office documents may have up for compound document, particularly the Structured to three classes of components: images, OLE-based Storage Interface. Finally we talk about the two na- embedded components, and virtual components. Im- tive file formats supported by MS-Office. ages are graphic data that are stored and manipu- lated directly by the application. This includes the 2.3.1 COM and OLE cached versions of any embedded components and any graphic data that the application choose to ma- The Component Object Model (COM) enables soft- nipulate directly. OLE-based embedded components ware components to export well-defined interfaces are data created using a separate application, as de- and interact with one another. In COM, software scribed above. Finally, virtual components are ob- components implement their services as one or more jects that are not implemented as OLE-based com- COM objects. Every object implement one or more ponents but that are perceived by the user as separate interfaces, each of which exports a number of meth- entities (i.e., pages in Word, slides in PowerPoint, ods. COM components communicate by invoking and sheets in Excel). these methods. The Object Linking and Embedding (OLE) speci- 2.3.4 File formats fication is a set of standard COM interfaces that enable users to create compound document by linking Microsoft Office 2000 supports two native file for- and embedding objects (components) into container mats: the traditional OLE-based binary format (here- applications, hence the name OLE. OLE includes after, “OLE archive”) and a new XML-based for- other component-based technologies, like ActiveX mat. The OLE archives [11, 10, 8, 9] rely on the and OLE Automation. OLE SSI to provide a unified view of the compound document in a single file. However, the manner in 2.3.2 Structured storage which MS-Office applications use the OLE SSI to store embedded object varies. Word and Excel, for The OLE Structure Storage Interface (SSI) provides example, store every embedded object in a separate the means for multiple components to share a single storage, making the component structure of the docu- file. This is necessary because in the user’s percep- ment visible to the OLE SSI. In contrast, PowerPoint tion, a compound document is single unit of infor- compresses embedded object native data and stores mation and should appear as a single file. it in the main application stream. While this strategy The SSI implements an abstraction similar to a file increases document compression, it limits the ability system within a single file.

Load more