The Efficient Storage of Text Documents in Digital Libraries the Efficient Storage of Text Documents in Digital Libraries
Total Page:16
File Type:pdf, Size:1020Kb
The Efficient Storage of Text Documents in Digital PrzemysławPrzemysław s kibińskiSkibiński andand Libraries JakubJakub s Swachawacha In this paper we investigate the possibility of improv- Although the student thesis electronic archive was ing the efficiency of data compression, and thus reduc- our motivation, we propose a solution that can be applied to any digital library containing text documents. As the ing storage requirements, for seven widely used text recent survey by Kahl and Williams revealed, 57.5 percent document formats. We propose an open-source text of the examined 1,117 digital library projects consisted of compression software library, featuring an advanced text content, so there are numerous libraries that could 4 word-substitution scheme with static and semidynamic benefit form implementation of the proposed scheme. In this paper, we describe a state-of-the-art approach word dictionaries. The empirical results show an average to text-document compression and present an open- storage space reduction as high as 78 percent compared to source software library implementing the scheme that uncompressed documents, and as high as 30 percent com- can be freely used in digital library projects. pared to documents compressed with the free compression In the case of text documents, improvement in com- pression effectiveness may be obtained in two ways: with software gzip. or without regard to their format. The more nontextual content in a document (e.g., formatting instructions, t is hard to expect the continuing rapid growth of global structure description, or embedded images), the more it information volume not to affect digital libraries.1 The requires format-specific processing to improve its com- Igrowth of stored information volume means growth pression ratio. This is because most document formats in storage requirements, which poses a problem in both have their own ways of describing their formatting, technological and economic terms. Fortunately, the digi- structure, and nontextual inclusions (plain text files have tal librarys’ hunger for resources can be tamed with data no inclusions). compression.2 For this reason, we have developed a compound The primary motivation for our research was to limit scheme that consists of several subschemes that can be the data storage requirements of the student thesis elec- turned on and off or run with different parameters. The tronic archive in the Institute of Information Technology most suitable solution for a given document format can in Management at the University of Szczecin. The current be obtained by merely choosing the right schemes and regulations state that every thesis should be submitted adequate parameter values. Experimentally, we have in both printed and electronic form. The latter facilitates found the optimal subscheme combinations for the fol- automated processing of the documents for purposes lowing formats used in digital libraries: plain text, TEX, such as plagiarism detection or statistical language analy- RTF, text annotated with XML, HTML, as well as the 5 sis. Considering the introduction of the three-cycle higher device-independent rendering formats PS and PDF. education system (bachelor/master/doctorate), there are First we discuss related work in text compression, several hundred theses added to the archive every year. then describe the basis of the proposed scheme and how Although students are asked to submit Microsoft it should be adapted for particular document formats. Word–compatible documents such as DOC, DOCX, and The section “Using the scheme in a digital library project” RTF, other popular formats such as TeX script (TEX), discusses how to use the free software library that imple- HTML, PS, and PDF are also accepted, both in the case ments the scheme. Then we cover the results of experi- of the main thesis document, containing the thesis and ments involving the proposed scheme and a corpus of any appendixes that were included in the printed ver- test files in each of the tested formats. sion, and the additional appendixes, comprising mate- rials that were left out of the printed version (such as detailed data tables, the full source code of programs, program manuals, etc.). Some of the appendixes may be n Text compression multimedia, in formats such as PNG, JPEG, or MPEG.3 Notice that this paper deals with text-document com- There are two basic principles of general-purpose data pression only. Although the size of individual text compression. The first one works on the level of char- documents is often significantly smaller than the size acter sequences, the second one works on the level of of individual multimedia objects, their collective vol- ume is large enough to make the compression effort worthwhile. The reason for focusing on text-document compression is that most multimedia formats have Przemysław SkibiSkibińńskiski ([email protected])([email protected]) is is[QY: Associate title?], efficient compression schemes embedded, whereas text Professor,Institute of ComputerInstitute ofScience, Computer University Science, of W rocUniversityław, Poland. of document formats usually either are uncompressed or WJakubrocław, Swacha Poland. ([email protected]) Jakub Swacha ([email protected] is [QY: title?], use schemes with efficiency far worse than the current .pl)Institute is Associate of Information Professor, Technology Institute in of Management, Information Technology University ofin state of the art in text compression. Management,Szczecin, Poland. University of Szczecin, Poland. THE EFFICIENT STORAGE OF TEXT DOCUMENTS IN DIGITAL LIBRARIES | SKIBIŃSKI AND SWACHA 143 individual characters. In the first case, the idea is to look is the dictionary of words that lists character sequences for matching character sequences in the past buffer of the that should be treated as single entities. The dictionary file being compressed and replace such sequences with can be dynamic (i.e., constructed on-line during the com- shorter code words; this principle underlies the algo- pression of every document),13 static (i.e., constructed rithms derived from the concepts of Arbraham Lempel off-line before the compression stage and once for every and Jacob Ziv (LZ-type).6 document of a given class—typically, the language of the In the second case, the idea is to gather frequency document determines its class),14 or semidynamic (i.e., statistics for characters in the file being compressed and constructed off-line before compression stage but indi- then assign shorter code words for frequent characters vidually for every document).15 Semidynamic dictionar- and longer ones for rare characters (this is exactly how ies must be stored along with the compressed document. Huffman coding works—what arithmetic coding assigns Dynamic dictionaries are reconstructed during decom- are value ranges rather than individual code words).7 pression (which makes the decoding slower than in the As the characters form words, and words form other cases). When the static dictionary is used, it must phrases, there is high correlation between subsequent be distributed with the decoder; since a single dictionary characters. To produce shorter code words, a compression is used to compress multiple files, it usually attains the algorithm either has to observe the context (understood best compression ratios, but it is only effective with docu- as several preceding characters) in which the character ments of the class it was originally prepared for. appeared and maintain separate frequency models for different contexts, or has to first decorrelate the characters (by sorting them according to their contexts) and then use an adaptive frequency model when compressing the out- n The basic compression scheme put (as the characters’ dependence on context becomes dependence on position). Whereas the former solution is The basis of our approach is a word-based, lossless text the foundation of Prediction by Partial Match (PPM) algo- compression scheme, dubbed Compression for Textual rithms, Burrows-Wheeler Transform (BWT) compression Digital Libraries (CTDL). The scheme consists of up to algorithms are based on the latter.8 four stages: Witten et al., in their seminal work Managing Gigabytes, emphasize the role of data compression in text storage 1. document decompression and retrieval systems, stating three requirements for the 2. dictionary composition compression process: good compression, fast decoding, 3. text transform and feasibility of decoding individual documents with 4. compression minimum overhead.9 The choice of compression algorithm should depend on what is more important for a specific Stages 1–2 are optional. The first is for retrieving tex- application: better compression or faster decoding. tual content from files compressed poorly with general- An early work of Jon Louis Bentley and others showed purpose methods. It is only executed for compressed that a significant improvement in text compression can input documents. It uses an embedded decompressor be achieved by treating a text document as a stream of for files compressed using the Deflate algorithm,16 but space-delimited words rather than individual characters.10 an external tool—Precomp—is used to decode natively This technique can be combined with any general-purpose compressed PDF documents.17 compression method in two ways: by redesigning charac- The second stage is for constructing the dictionary ter-based algorithms as word-based ones or by implement- of the most frequent words in the processed document. ing a two-stage scheme whose first step is a transform Doing so is a good idea when the compressed documents replacing words with dictionary indices and whose second have no common set of words. If there are many docu- step is passing the transformed text through any general- ments in the same language, a common dictionary fares purpose compressor.11 From the designer’s point of view, better—it usually does not pay off to store an individual although the first approach provides more control over dictionary with each file because they all contain similar how the text is modeled, the second approach is much eas- lists of words.