A Characterization of Compound Documents on the Web
Eyal de Lara†, Dan S. Wallach‡, and Willy Zwaenepoel‡ † Department of Electrical and Computer Engineering ‡ Department of Computer Science Rice University
Abstract 3. For small documents, XML format produces much larger documents than OLE. For large Recent developments in office productivity suites documents, there is little difference. make it easier for users to publish rich compound documents on the Web. Compound documents ap- 4. Compression considerably reduces the size of pear as a single unit of information but may con- documents in both formats. tain data generated by different applications, such as text, images, and spreadsheets. Given the popu- 1 Introduction larity enjoyed by these office suites and the perva- siveness of the Web as a publication medium, we Productivity tools, often part of “office suites,” are expect that in the near future these compound doc- the most popular applications for creating docu- uments will become an increasing proportion of the ments. Their popularity derives, to some extent, from Web’s content. As a result, the content handled by their capability to create compound documents that servers, proxies, and browsers may change consider- include data from more than one application. ably from what is currently observed. Furthermore, The documents produced by office suites have these compound documents are currently treated as been, until recently, very unwieldy for putting on the opaque byte streams, but future Web infrastructure Web. These documents can be quite large, some- may wish to understand their internal structure to times hundreds of kilobytes, and Web browsers do provide higher-quality service. not know how to display them without first com- In order to guide the design of this future Web in- pletely downloading the document. frastructure, we characterize compound documents Recent improvements in office suites have made it currently found on the Web. Previous studies of Web easier to publish compound documents on the Web content either ignored these document types alto- by exporting browser-compatible data types. These gether or did not consider their internal structure. We improvements, coupled with the popularity of pro- study compound documents originated by the three ductivity tools, will likely have a strong impact on most popular applications from the Microsoft Of- the content of the Web and the infrastructure that fice suite: Word, Excel, and PowerPoint. Our study supports this content. Previous studies have either encompasses over 12,500 documents retrieved from overlooked compound document altogether or have 935 different Web sites. Our main conclusions are: treated them as black boxes [2, 18, 16, 1, 17]. There- fore, it becomes important to study and characterize 1. Compound documents are in general much compound documents on the Web, as they are now, larger than current HTML documents. to predict where the Web might be going, and how Web infrastructure should support compound docu- 2. For large documents, embedded objects and im- ments in the future. ages make up a large part of the documents’ For this paper, we studied compound documents size. generated by the three most popular application of
1 the Microsoft Office suite: Word, Excel, and Power- tain elements created by different applications. A Point. We chose to focus on Microsoft Office appli- compound document could, for instance, consist of cations based on two factors. First, Microsoft Office a spreadsheet and several images embedded into a is the most widely-used productivity suite. More- text document. over, a significant number of Microsoft Office docu- In the general case, every data type in a compound ments are available on the Web, enabling us to gather document (spreadsheet, text, images, sound, etc) is the data for our experiments. Second, Microsoft Of- created and managed by a different application. The fice 2000 supports two native file formats: the propri- different applications used to create the document etary OLE-based binary format and a new XML for- can be thought of as software components that pro- mat. By using Office 2000 to convert old files to the vide services that are invoked to create, edit, and dis- new XML format, we can compare the tradeoffs of play the compound document. using a proprietary binary-based file format against a modern standards-based text format, both as inter- mediate formats suitable for document editing, and as publishing formats, suitable only for reading. 2.2 Enabling technologies We downloaded over 12500 documents, compris- ing over 4 GB of data, from 935 different sites. Our 2.2.1 Component standards main results are: Compound documents result from combining the 1. Compound documents are in general much data created by two or more disjoint software com- larger than current HTML documents. The tail ponents. As a result, there is a need for a standard to of the size distribution follows the same power- govern the interactions between components. Some log distribution observed with HTML docu- of the most visible standards are COM/OLE [3, 4], ments. SOM/OpenDoc [14], and JavaBeans [7]. Among other things, such standards define how components 2. For large documents, embedded objects and im- are uniquely identified, how they are stored on disk, ages comprise a large part of the document. and how they interact with one another and system 3. For small documents, XML format produces resources such as the screen. much larger documents than OLE. For large documents, there is little difference.
4. Compression considerably reduces the size of 2.2.2 Storage documents in both formats. Typical container applications keep in persistent stor- The rest of this document is organized as follows. age two versions of the components they embed. The Section 2 provides some background on compound first one consists of the embedded component’s na- documents and their enabling technology. We also tive data, which is used to initialize the component. discuss relevant characteristics of the three Microsoft This data is created and managed by the component Office applications that we use in this study. Sec- itself. The second representation is a cached image tion 3 describes the documents we used in our exper- of the state of the component the last time it was iments. Section 4 presents our experimental results. instantiated. This image, although created by the Finally, section 5 discusses our conclusions. component, is managed by the container application. This image serves two purposes. First, it allows the document to be rendered quickly, since the code that 2 Background understands the component’s specific type need not 2.1 Compound documents be executed until the user wishes to modify the com- ponent. Second, the cached image allow the doc- To its user, a compound document appears to be a ument to be rendered even on systems where some single unit of information, but in fact it can con- components are not available.
2 2.3 Microsoft Office parent container need not understand the information stored within it. In this section we explore the technologies used by MS-Office to enable compound documents. We start with an overview of COM and OLE. We focus on 2.3.3 Component taxonomy the parts of these technologies that are important Conceptually, MS-Office documents may have up for compound document, particularly the Structured to three classes of components: images, OLE-based Storage Interface. Finally we talk about the two na- embedded components, and virtual components. Im- tive file formats supported by MS-Office. ages are graphic data that are stored and manipu- lated directly by the application. This includes the 2.3.1 COM and OLE cached versions of any embedded components and any graphic data that the application choose to ma- The Component Object Model (COM) enables soft- nipulate directly. OLE-based embedded components ware components to export well-defined interfaces are data created using a separate application, as de- and interact with one another. In COM, software scribed above. Finally, virtual components are ob- components implement their services as one or more jects that are not implemented as OLE-based com- COM objects. Every object implement one or more ponents but that are perceived by the user as separate interfaces, each of which exports a number of meth- entities (i.e., pages in Word, slides in PowerPoint, ods. COM components communicate by invoking and sheets in Excel). these methods. The Object Linking and Embedding (OLE) speci- 2.3.4 File formats fication is a set of standard COM interfaces that en- able users to create compound document by linking Microsoft Office 2000 supports two native file for- and embedding objects (components) into container mats: the traditional OLE-based binary format (here- applications, hence the name OLE. OLE includes after, “OLE archive”) and a new XML-based for- other component-based technologies, like ActiveX mat. The OLE archives [11, 10, 8, 9] rely on the and OLE Automation. OLE SSI to provide a unified view of the compound document in a single file. However, the manner in 2.3.2 Structured storage which MS-Office applications use the OLE SSI to store embedded object varies. Word and Excel, for The OLE Structure Storage Interface (SSI) provides example, store every embedded object in a separate the means for multiple components to share a single storage, making the component structure of the docu- file. This is necessary because in the user’s percep- ment visible to the OLE SSI. In contrast, PowerPoint tion, a compound document is single unit of infor- compresses embedded object native data and stores mation and should appear as a single file. it in the main application stream. While this strategy The SSI implements an abstraction similar to a file increases document compression, it limits the ability system within a single file. It supports types of ob- of third-party applications to manipulate components jects: storages and streams. Storages are analogous within a PowerPoint document. to directories and contains streams or more storages. The new XML format [12] provides a more Stream are analogous to files and contain the compo- browser-friendly option for storing MS-Office doc- nents data. uments. Where an OLE archive appears as a single In compound documents, each embedded compo- file, an XML document appears as an entire direc- nent is stored in a separate storage. When an embed- tory of XML files, approximately one per compo- ded components is instantiated, its embedding con- nent, image, or slide. The current implementation of tainer supplies it with a pointer to the storage that MS-Office supports two version of XML output: a holds the components native data. The embedded compact low-fidelity representation that can be read component uses these data to initialize its state. An by browsers but cannot be edited by MS-Office tools, embedded component manages its own storage; the and a larger high quality representation that supports
3 editing. In this study we focus on the latter XML representation. Aside from the number of files that they use, the two file formats differ mostly in their representation of text and formating information. Images and em- bedded component native data have similar represen- tation in both formats, with the caveat that compo- nent data in the XML-based format is stored in a compressed OLE archive.
Application Domain Documents Sites 3 Data set Word com 412 28 edu 813 73 We collected Word, Excel, and PowerPoint docu- gov 1376 50 ments from the Web. First, we used a commercial org 362 58 search engine to obtain an initial set of URLs. We other 362 27 searched for pages having links to files with suffixes local 3007 2 we are interested in (ppt, xls,anddoc). Then, Subtotal 6481 236 we used GNU Wget [15] to recursively retrieve doc- PowerPoint com 515 95 uments from our initial search results. edu 669 73 All downloaded documents were in the binary gov 333 45 OLE archive format. Because Microsoft’s file for- org 474 111 mats vary from one version of their software to an- other 51 10 other, we first converted all our data to the Of- local 125 1 fice 2000 formats. We removed documents that ap- Subtotal 2167 334 peared to be corrupt or were not actually Office doc- Excel com 553 126 uments; the doc suffix, in particular, tends to be used edu 1520 76 by many applications other than Microsoft Word. We gov 1343 48 also eliminated duplicates, removing approximately org 448 93 5% of our data set. other 88 35 We converted all the data to Office 2000 formats local 104 1 and obtained the XML-based representation, us- Subtotal 4056 378 ing the MS-Office OLE Automation interfaces [13]. OLE Automation allows a simple Java application Total 12704 935 we wrote to remotely control the Office applications Table 1: Data set. This table presents the domains to perform the data conversions. and sites from which our test documents originated. Table 1 shows a summary of the documents. For These numbers reflect the documents that remain af- each application, it presents the Internet domains ter duplicates and corrupted files were removed. from which documents originated, the number of documents, and the number of sites. The local do- main corresponds to documents obtained from our local file system. These documents were taken from our local NFS server rather than the Web.
4 Experiments
This section presents statistics we have measured for MS-Office documents and the components within
4 them. We compare our results to previously pub- lished statistics for HTML documents.
700 4.1 Document Size 600 500 Table 2 shows general statistics for Word, Excel, and 400 1 PowerPoint documents . The most striking aspects 300 of the data is the large average size of documents and 200 the large standard deviations of our sample.
Number of Documents 100 Figure 1 shows the size distribution of Word, Ex- cel, and PowerPoint documents. The histogram plots 0 0 50 100 150 documents with sizes up to 180 KB. We observe that the distributions have the same general shape: a clus- Document Size (KB) ter around a common small value with a fairly long Word PPT Excel tail. Figure 1: Size distribution of Word, PowerPoint, and Figure 2 characterizes the distributions’ tails by Excel documents. Shown are documents with sizes plotting document size frequencies for documents up to 180 KB.
larger than 100 KB on a log-log scale. The lin- : 1 7124 ear fit of the transformed data (y x ) with
2 : R = 0 8938 suggest that the tail of the size dis- tribution follows closely the power-law distribution, which explains the large standard deviations of ta- ble 2. The log-log scale histograms for the individ- ual Word, PowerPoint, and Excel documents are not
shown here since they are all similar to the cumula- :