An Analysis of XML Compression Efficiency Christopher J. Augeri1 Barry E. Mullins1 Leemon C. Baird III Dursun A. Bulutoglu2 Rusty O. Baldwin1 1Department of Electrical and Computer Engineering Department of Computer Science 2Department of Mathematics and Statistics United States Air Force Academy (USAFA) Air Force Institute of Technology (AFIT) USAFA, Colorado Springs, CO Wright Patterson Air Force Base, Dayton, OH {chris.augeri, barry.mullins}@afit.edu
[email protected] {dursun.bulutoglu, rusty.baldwin}@afit.edu ABSTRACT We expand previous XML compression studies [9, 26, 34, 47] by XML simplifies data exchange among heterogeneous computers, proposing the XML file corpus and a combined efficiency metric. but it is notoriously verbose and has spawned the development of The corpus was assembled using guidelines given by developers many XML-specific compressors and binary formats. We present of the Canterbury corpus [3], files often used to assess compressor an XML test corpus and a combined efficiency metric integrating performance. The efficiency metric combines execution speed compression ratio and execution speed. We use this corpus and and compression ratio, enabling simultaneous assessment of these linear regression to assess 14 general-purpose and XML-specific metrics, versus prioritizing one metric over the other. We analyze compressors relative to the proposed metric. We also identify key collected metrics using linear regression models (ANOVA) versus factors when selecting a compressor. Our results show XMill or a simple comparison of means, e.g., X is 20% better than Y. WBXML may be useful in some instances, but a general-purpose compressor is often the best choice. 2. XML OVERVIEW XML has gained much acceptance since first proposed in 1998 by Categories and Subject Descriptors the World-Wide Web Consortium (W3C).