Algorithmics :: Compressing Data

Total Page:16

File Type:pdf, Size:1020Kb

Algorithmics :: Compressing Data Compressing Data Konstantin Tretyakov ([email protected]) MTAT.03.238 Advanced Algorithmics April 26, 2012 Algorithmics 26.04.2012 Claude Elwood Shannon (1916-2001) Algorithmics 26.04.2012 C. E. Shannon. A mathematical theory of communication. 1948 Algorithmics 26.04.2012 C. E. Shannon. The mathematical theory of communication. 1949 Algorithmics 26.04.2012 Shannon-Fano coding Nyquist-Shannon sampling theorem Shannon-Hartley theorem Shannon’s noisy channel coding theorem Shannon’s source coding theorem Rate-distortion theory Ethernet, Wifi, GSM, CDMA, EDGE, CD, DVD, BD, ZIP, JPEG, MPEG, … Algorithmics 26.04.2012 MTMS.02.040 Informatsiooniteooria (3-5 EAP) Jüri Lember http://ocw.mit.edu/ 6.441 Information Theory https://www.coursera.org/courses/ Algorithmics 26.04.2012 Basic terms: Information, Code “Information” “Coding”, “Code” Can you code the same information differently? Why would you? What properties can you require from a coding scheme? Are they contradictory? Show 5 ways of coding the concept “number 42” What is the shortest way of coding this concept? How many bits are needed? Aha! Now define the term “code” once again. Algorithmics 26.04.2012 Basic terms: Coding Suppose we have a set of three concepts. Denote them as A, B and C. Propose a code for this set. Consider the following code: A → 0, B → 1, C → 01 What do you think about it? Define “variable length code”. Define “uniquely decodable code”. Algorithmics 26.04.2012 Basic terms: Prefix-free If we want to code series of messages, what would be a great property for a code to have? Define “prefix-free code”. For historical reasons those are more often referred to as “prefix codes”. Find a prefix-free code for {A, B, C}. Is it uniquely decodable? Is prefix-free ⇒ uniquely decodable? Is uniquely decodable ⇒ prefix-free? Algorithmics 26.04.2012 Prefix-free code .. can always be represented as a tree with symbols at the leaves. Algorithmics 26.04.2012 Compression Consider some previously derived code for {A, B, C}. Is it good for compression purposes? Define “expected code length”. Let event probabilities be as follows: A → 0.50, B → 0.25, C → 0.25 Find the shortest possible prefix-free code. Algorithmics 26.04.2012 Compression & Prefix coding Does the “prefix-free” property sacrifice code length? No! For each uniquely-decodable code there exists a prefix-code with the same codeword lengths. Algorithmics 26.04.2012 Huffman code Consider the following event probabilities A → 0.50, B → 0.25, C → 0.125, D → 0.125 and some event sequence ADABAABACDABACBA… Replace all events C and D with a new event “Z”. Construct the optimal code for {A, B, Z} Extend this code to a new code for {A, B, C, D} Algorithmics 26.04.2012 Huffman coding algorithm Generalize the previous construction to construct an optimal prefix-free code. Use Huffman coding to encode “YAYBANANABANANA” Compare its efficiency to straightforward 2-bit encoding. D. Huffman. “A Method for the Construction of Minimum-Redundancy Codes”, 1952 Algorithmics 26.04.2012 Huffman coding in practice Is just saving the result of Huffman coding to file enough? What else should be done? How? Straightforward approach – dump the tree using preorder traversal. Smarter approach – save only code lengths Wikipedia: Canonical Huffman Code RFC1951: DEFLATE Compressed Data Format Specification version 1.3, Section 3.2.2 Algorithmics 26.04.2012 Huffman code optimality Consider an alphabet, sorted by event (letter) probability, e.g. 푥1 → 0.42, 푥2 → 0.25, … , 푥9 → 0.01, 푥10 → 0.01 Is there just a single optimal code for it, or several of them? Algorithmics 26.04.2012 Huffman code optimality Show that each optimal code has: 푙 푥1 ≤ 푙 푥2 ≤ ⋯ ≤ 푙(푥10) Show that there is at least one optimal code where 푥9 and 푥10 are siblings in the prefix tree. Let 퐿 be the expected length of the optimal code. Merge 푥9 and 푥10, and let 퐿푠 be the expected length of the resulting smaller code. Express 퐿 in terms of 퐿푠. Complete the proof. Algorithmics 26.04.2012 Huffman code in real life Which of those use Huffman coding? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2 All of them do, as a post-processing step. Algorithmics 26.04.2012 Shannon-Fano code I randomly chose a letter from this probability: A → 0.45, B → 0.35, C → 0.125, D → 0.125 You need to guess it in the smallest expected number of yes/no questions. Devise an optimal strategy. Algorithmics 26.04.2012 Shannon-Fano code Constructs a prefix-code in a top-down manner: Split the alphabet into two parts with as equal probability as possible. Construct a code for each part. Prepend 0 to codes of the first part Prepend 1 to codes of the second part. Is Shannon-Fano the same as Huffman? Algorithmics 26.04.2012 Shannon-Fano & Huffman Shannon-Fano is not always optimal. Show that it is optimal, though, for letter probabilities of the form 1/2푘. Algorithmics 26.04.2012 log(p) as amount of information Let letter probabilities all be of the form 1 p = 2푘 Show that for the optimal prefix code, the length of codeword for a letter with probability 1 p = i 2푘 is exactly 1 푘 = log2 = −log2pi. pi Algorithmics 26.04.2012 Why logarithms? Intuitively, we want a measure of information to be “additive”. Receiving N equivalent events must correspond to “N times” the information in the single event. However, probabilities are … Therefore, the most logical way to measure information of an event is … Algorithmics 26.04.2012 The thing to remember 1 log 2 푝 is the information content of a single random event with probability 푝. For 푝 of the form 2−푘 it is exactly the number of bits needed to code this event using an optimal binary prefix-free code. Algorithmics 26.04.2012 The thing to remember 1 log 2 푝 is the information content of a single random event with probability 푝. −푘 For 푝For of other the values form of p the 2 information it is content exactly is not an the integer. number Obviously you can’t use something like “2.5 bits” to encode a symbol. However, for of bitslonger textsneeded you can codeto codemultiple symbolsthis eventat once and using in this case an you can optimalachieve binary the average prefix coding -ratefree of this code. number (e.g. 2.5) bits per each presence of the corresponding event. Algorithmics 26.04.2012 Expected codeword length Let letter probabilities all be of the form 1 p = 2푘 What is the expected code length for the optimal binary prefix-free code? Algorithmics 26.04.2012 The thing to remember For a given discrete probability distribution, the function 1 1 퐻 푝1, 푝2, … , 푝푛 = 푝1 log2 + ⋯ + 푝푛 log2 푝1 푝푛 is called the entropy of this distribution. Algorithmics 26.04.2012 Meaning of entropy The average codeword length 퐿 for both Huffman and Shannon-Fano codes satisfies: 퐻 푃 ≤ 퐿 < 퐻(푃) + 1 Algorithmics 26.04.2012 Meaning of entropy Shannon Source Coding Theorem A sequence of 푁 events from probability 푃 can be losslessly represented as a sequence of 푁 ⋅ 퐻(푃) bits for sufficiently large 푁. Conversely, it is impossible to losslessly represent a the sequence using less than 푁 ⋅ 퐻(푃) bits. Algorithmics 26.04.2012 The things to remember 1 log 2 푝 is the information content of a single random event with probability 푝, measured in bits. 퐻(푃) Is the expected information content for the distribution 푃, measured in bits. Algorithmics 26.04.2012 The things to remember 1 log 2 푝 is theI.e. it information is the expected number ofcontent bits necessary toof optimally a single encode random an event event with probabilitywith such푝, probability.measured in bits. 퐻(푃) Is theI.e. itexpected is the expected number information of bits necessary to contentoptimally encode for a single the distribution 푃, randommeasured event from this in distribution. bits. Algorithmics 26.04.2012 Demonstrate an N-element distribution with zero entropy. Demonstrate an N-element distribution with maximal entropy. Define entropy for a continuous distribution 푝(푥). Algorithmics 26.04.2012 Is Huffman code good for coding: Images? None of them, because Music? Huffman coding assumes an I.I.D. sequence, yet all of those Text? have a lot of structure. What is it good for? It is good for coding random- like sequences. Algorithmics 26.04.2012 Say we need to encode the text THREE SWITCHED WITCHES WATCH THREE SWISS SWATCH WATCH SWITCHES. WHICH SWITCHED WITCH WATCHES WHICH SWISS SWATCH WATCH SWITCH? Can we code this better than Huffman? Of course, if we use a dictionary. Can we build the dictionary adaptively from the data itself? Algorithmics 26.04.2012 Lempel-Ziv-Welch algorithm Say we want to code string “AABABBCAB” Start with a dictionary {0 → “”} Scan string from the beginning. Find the longest prefix present in the dictionary (0, “”). Read one more letter “A”. Output prefix id and this letter (0, “A”). Append <current prefix><current letter> to the dictionary. New dictionary: {0 → “”, 1 → “A”}. Finish the coding. Terry Welch, “A Technique for High-PerformanceAlgorithmics 26.04.2012Data Compression,” 1984. LZW Algorithm Unpack the obtained code. Can we do smarter initialization? If we pack a long text, the dictionary may bloat. How do we handle it? In practice LZW coding is followed by Huffman (or a similar) coding. Algorithmics 26.04.2012 Theorem LZW coding is asymptotically optimal. I.e. as the length of the string goes to infinity, the compression ratio approaches the best possible (given some conditions).
Recommended publications
  • Lecture 24 1 Kolmogorov Complexity
    6.441 Spring 2006 24-1 6.441 Transmission of Information May 18, 2006 Lecture 24 Lecturer: Madhu Sudan Scribe: Chung Chan 1 Kolmogorov complexity Shannon’s notion of compressibility is closely tied to a probability distribution. However, the probability distribution of the source is often unknown to the encoder. Sometimes, we interested in compressibility of specific sequence. e.g. How compressible is the Bible? We have the Lempel-Ziv universal compression algorithm that can compress any string without knowledge of the underlying probability except the assumption that the strings comes from a stochastic source. But is it the best compression possible for each deterministic bit string? This notion of compressibility/complexity of a deterministic bit string has been studied by Solomonoff in 1964, Kolmogorov in 1966, Chaitin in 1967 and Levin. Consider the following n-sequence 0100011011000001010011100101110111 ··· Although it may appear random, it is the enumeration of all binary strings. A natural way to compress it is to represent it by the procedure that generates it: enumerate all strings in binary, and stop at the n-th bit. The compression achieved in bits is |compression length| ≤ 2 log n + O(1) More generally, with the universal Turing machine, we can encode data to a computer program that generates it. The length of the smallest program that produces the bit string x is called the Kolmogorov complexity of x. Definition 1 (Kolmogorov complexity) For every language L, the Kolmogorov complexity of the bit string x with respect to L is KL (x) = min l(p) p:L(p)=x where p is a program represented as a bit string, L(p) is the output of the program with respect to the language L, and l(p) is the length of the program, or more precisely, the point at which the execution halts.
    [Show full text]
  • Large Alphabet Source Coding Using Independent Component Analysis Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE
    IEEE TRANSACTIONS ON INFORMATION THEORY 1 Large Alphabet Source Coding using Independent Component Analysis Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE Abstract Large alphabet source coding is a basic and well–studied problem in data compression. It has many applications such as compression of natural language text, speech and images. The classic perception of most commonly used methods is that a source is best described over an alphabet which is at least as large as the observed alphabet. In this work we challenge this approach and introduce a conceptual framework in which a large alphabet source is decomposed into “as statistically independent as possible” components. This decomposition allows us to apply entropy encoding to each component separately, while benefiting from their reduced alphabet size. We show that in many cases, such decomposition results in a sum of marginal entropies which is only slightly greater than the entropy of the source. Our suggested algorithm, based on a generalization of the Binary Independent Component Analysis, is applicable for a variety of large alphabet source coding setups. This includes the classical lossless compression, universal compression and high-dimensional vector quantization. In each of these setups, our suggested approach outperforms most commonly used methods. Moreover, our proposed framework is significantly easier to implement in most of these cases. I. INTRODUCTION SSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical A source coding problem is concerned with finding a sample-to-codeword mapping, such that the average codeword length is minimal, and the samples may be uniquely decodable.
    [Show full text]
  • Design and Implementation of a Decompression Engine for a Huffman-Based Compressed Data Cache Master’S Thesis in Embedded Electronic System Design
    Chapter 1 Introduction Design and implementation of a decompression engine for a Huffman-based compressed data cache Master’s Thesis in Embedded Electronic System Design LI KANG Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden, 2014 The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. Design and implementation of a decompression engine for a Huffman-based compressed data cache Li Kang © Li Kang January 2014. Supervisor & Examiner: Angelos Arelakis, Per Stenström Chalmers University of Technology Department of Computer Science and Engineering SE-412 96 Göteborg Sweden Telephone + 46 (0)31-772 1000 [Cover: Pipelined Huffman-based decompression engine, page 8. Source: A. Arelakis and P. Stenström, “A Case for a Value-Aware Cache”, IEEE Computer Architecture Letters, September 2012.] Department of Computer Science and Engineering Göteborg, Sweden January 2014 2 Abstract This master thesis studies the implementation of a decompression engine for Huffman based compressed data cache.
    [Show full text]
  • Greedy Algorithm Implementation in Huffman Coding Theory
    iJournals: International Journal of Software & Hardware Research in Engineering (IJSHRE) ISSN-2347-4890 Volume 8 Issue 9 September 2020 Greedy Algorithm Implementation in Huffman Coding Theory Author: Sunmin Lee Affiliation: Seoul International School E-mail: [email protected] <DOI:10.26821/IJSHRE.8.9.2020.8905 > ABSTRACT In the late 1900s and early 2000s, creating the code All aspects of modern society depend heavily on data itself was a major challenge. However, now that the collection and transmission. As society grows more basic platform has been established, efficiency that can dependent on data, the ability to store and transmit it be achieved through data compression has become the efficiently has become more important than ever most valuable quality current technology deeply before. The Huffman coding theory has been one of desires to attain. Data compression, which is used to the best coding methods for data compression without efficiently store, transmit and process big data such as loss of information. It relies heavily on a technique satellite imagery, medical data, wireless telephony and called a greedy algorithm, a process that “greedily” database design, is a method of encoding any tries to find an optimal solution global solution by information (image, text, video etc.) into a format that solving for each optimal local choice for each step of a consumes fewer bits than the original data. [8] Data problem. Although there is a disadvantage that it fails compression can be of either of the two types i.e. lossy to consider the problem as a whole, it is definitely or lossless compressions.
    [Show full text]
  • Shannon Entropy and Kolmogorov Complexity
    Information and Computation: Shannon Entropy and Kolmogorov Complexity Satyadev Nandakumar Department of Computer Science. IIT Kanpur October 19, 2016 This measures the average uncertainty of X in terms of the number of bits. Shannon Entropy Definition Let X be a random variable taking finitely many values, and P be its probability distribution. The Shannon Entropy of X is X 1 H(X ) = p(i) log : 2 p(i) i2X Shannon Entropy Definition Let X be a random variable taking finitely many values, and P be its probability distribution. The Shannon Entropy of X is X 1 H(X ) = p(i) log : 2 p(i) i2X This measures the average uncertainty of X in terms of the number of bits. The Triad Figure: Claude Shannon Figure: A. N. Kolmogorov Figure: Alan Turing Just Electrical Engineering \Shannon's contribution to pure mathematics was denied immediate recognition. I can recall now that even at the International Mathematical Congress, Amsterdam, 1954, my American colleagues in probability seemed rather doubtful about my allegedly exaggerated interest in Shannon's work, as they believed it consisted more of techniques than of mathematics itself. However, Shannon did not provide rigorous mathematical justification of the complicated cases and left it all to his followers. Still his mathematical intuition is amazingly correct." A. N. Kolmogorov, as quoted in [Shi89]. Kolmogorov and Entropy Kolmogorov's later work was fundamentally influenced by Shannon's. 1 Foundations: Kolmogorov Complexity - using the theory of algorithms to give a combinatorial interpretation of Shannon Entropy. 2 Analogy: Kolmogorov-Sinai Entropy, the only finitely-observable isomorphism-invariant property of dynamical systems.
    [Show full text]
  • Information Theory
    Information Theory Professor John Daugman University of Cambridge Computer Science Tripos, Part II Michaelmas Term 2016/17 H(X,Y) I(X;Y) H(X|Y) H(Y|X) H(X) H(Y) 1 / 149 Outline of Lectures 1. Foundations: probability, uncertainty, information. 2. Entropies defined, and why they are measures of information. 3. Source coding theorem; prefix, variable-, and fixed-length codes. 4. Discrete channel properties, noise, and channel capacity. 5. Spectral properties of continuous-time signals and channels. 6. Continuous information; density; noisy channel coding theorem. 7. Signal coding and transmission schemes using Fourier theorems. 8. The quantised degrees-of-freedom in a continuous signal. 9. Gabor-Heisenberg-Weyl uncertainty relation. Optimal \Logons". 10. Data compression codes and protocols. 11. Kolmogorov complexity. Minimal description length. 12. Applications of information theory in other sciences. Reference book (*) Cover, T. & Thomas, J. Elements of Information Theory (second edition). Wiley-Interscience, 2006 2 / 149 Overview: what is information theory? Key idea: The movements and transformations of information, just like those of a fluid, are constrained by mathematical and physical laws. These laws have deep connections with: I probability theory, statistics, and combinatorics I thermodynamics (statistical physics) I spectral analysis, Fourier (and other) transforms I sampling theory, prediction, estimation theory I electrical engineering (bandwidth; signal-to-noise ratio) I complexity theory (minimal description length) I signal processing, representation, compressibility As such, information theory addresses and answers the two fundamental questions which limit all data encoding and communication systems: 1. What is the ultimate data compression? (answer: the entropy of the data, H, is its compression limit.) 2.
    [Show full text]
  • Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures
    Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures Jiannan Tian?, Cody Riveray, Sheng Diz, Jieyang Chenx, Xin Liangx, Dingwen Tao?, and Franck Cappelloz{ ?School of Electrical Engineering and Computer Science, Washington State University, WA, USA yDepartment of Computer Science, The University of Alabama, AL, USA zMathematics and Computer Science Division, Argonne National Laboratory, IL, USA xOak Ridge National Laboratory, TN, USA {University of Illinois at Urbana-Champaign, IL, USA Abstract—Today’s high-performance computing (HPC) appli- much more slowly than computing power, causing intra-/inter- cations are producing vast volumes of data, which are challenging node communication cost and I/O bottlenecks to become a to store and transfer efficiently during the execution, such that more serious issue in fast stream processing [6]. Compressing data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is the raw simulation data at runtime and decompressing them arguably the most efficient Entropy coding algorithm in informa- before post-analysis can significantly reduce communication tion theory, such that it could be found as a fundamental step and I/O overheads and hence improving working efficiency. in many modern compression algorithms such as DEFLATE. On Huffman coding is a widely-used variable-length encoding the other hand, today’s HPC applications are more and more method that has been around for over 60 years [17]. It is relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, arguably the most cost-effective Entropy encoding algorithm resulting in a significant bottleneck in the entire data processing.
    [Show full text]
  • Lecture 2: Variable-Length Codes Continued
    Data Compression Techniques Part 1: Entropy Coding Lecture 2: Variable-Length Codes Continued Juha K¨arkk¨ainen 01.11.2017 1 / 16 Kraft's Inequality When constructing a variable-length code, we are not really interested in what the individual codewords are as long as they satisfy two conditions: I The code is a prefix code (or at least a uniquely decodable code). I The codeword lengths are chosen to minimize the average codeword length. Kraft's inequality gives an exact condition for the existence of a prefix code in terms of the codeword lengths. Theorem (Kraft's Inequality) There exists a binary prefix code with codeword lengths `1; `2; : : : ; `σ if and only if σ X 2−`i ≤ 1 : i=1 2 / 16 Proof of Kraft's Inequality Consider a binary search on the real interval [0; 1). In each step, the current interval is split into two halves and one of the halves is chosen as the new interval. We can associate a search of ` steps with a binary string of length `: I Zero corresponds to choosing the left half. I One corresponds to choosing the right half. For any binary string w, let I(w) be the final interval of the associated search. Example 1011 corresponds to the search sequence [0; 1); [1=2; 2=2); [2=4; 3=4); [5=8; 6=8); [11=16; 12=16) and I(1011) = [11=16; 12=16). 3 / 16 Consider the set fI(w) j w 2 f0; 1g`g of all intervals corresponding to binary strings of lengths `.
    [Show full text]
  • Paper 1 Jian Li Kolmold
    KolmoLD: Data Modelling for the Modern Internet* Dmitry Borzov, Huawei Canada Tim Tingqiu Yuan, Huawei Mikhail Ignatovich, Huawei Canada Jian Li, Futurewei *work performed before May 2019 Challenges: Peak Traffic Composition 73% 26% Content: sizable, faned-out, static Streaming Services Everything else (Netflix, Hulu, Software File storage (Instant Messaging, YouTube, Spotify) distribution services VoIP, Social Media) [1] Source: Sandvine Global Internet Phenomena reports for 2009, 2010, 2011, 2012, 2013, 2015, 2016, October 2018 [1] https://qz.com/1001569/the-cdn-heavy-internet-in-rich-countries-will-be-unrecognizable-from-the-rest-of-the-worlds-in-five-years/ Technologies to define the revolution of the internet ChunkStream Founded in 2016 Video codec Founded in 2014 Content-addressable Based on a 2014 MIT Browser-targeted Runtime Content-addressable network protocol based research paper network protocol based on cryptohash naming on cryptohash naming scheme Implemented and supported by all major scheme Based on the Open source project cryptohash naming browsers, an IETF standard Founding company is a P2P project scheme YCombinator graduate YCombinator graduate with backing of high profile SV investors Our Proposal: A data model for interoperable protocols KolmoLD Content addressing through hashes has become a widely-used means of Addressable: connecting layer, inspired by connecting data in distributed the principles of Kolmogorov complexity systems, from the blockchains that theory run your favorite cryptocurrencies, to the commits that back your code, to Compossible: sending data as code, where the web’s content at large. code efficiency is theoretically bounded by Kolmogorov complexity Yet, whilst all of these tools rely on Computable: sandboxed computability by some common primitives, their treating data as code specific underlying data structures are not interoperable.
    [Show full text]
  • Maximizing T-Complexity 1. Introduction
    Fundamenta Informaticae XX (2015) 1–19 1 DOI 10.3233/FI-2012-0000 IOS Press Maximizing T-complexity Gregory Clark U.S. Department of Defense gsclark@ tycho. ncsc. mil Jason Teutsch Penn State University teutsch@ cse. psu. edu Abstract. We investigate Mark Titchener’s T-complexity, an algorithm which measures the infor- mation content of finite strings. After introducing the T-complexity algorithm, we turn our attention to a particular class of “simple” finite strings. By exploiting special properties of simple strings, we obtain a fast algorithm to compute the maximum T-complexity among strings of a given length, and our estimates of these maxima show that T-complexity differs asymptotically from Kolmogorov complexity. Finally, we examine how closely de Bruijn sequences resemble strings with high T- complexity. 1. Introduction How efficiently can we generate random-looking strings? Random strings play an important role in data compression since such strings, characterized by their lack of patterns, do not compress well. For prac- tical compression applications, compression must take place in a reasonable amount of time (typically linear or nearly linear time). The degree of success achieved with the classical compression algorithms of Lempel and Ziv [9, 21, 22] and Welch [19] depend on the complexity of the input string; the length of the compressed string gives a measure of the original string’s complexity. In the same monograph in which Lempel and Ziv introduced the first of these algorithms [9], they gave an example of a succinctly describable class of strings which yield optimally poor compression up to a logarithmic factor, namely the class of lexicographically least de Bruijn sequences for each length.
    [Show full text]
  • 1952 Washington UFO Sightings • Psychic Pets and Pet Psychics • the Skeptical Environmentalist Skeptical Inquirer the MAGAZINE for SCIENCE and REASON Volume 26,.No
    1952 Washington UFO Sightings • Psychic Pets and Pet Psychics • The Skeptical Environmentalist Skeptical Inquirer THE MAGAZINE FOR SCIENCE AND REASON Volume 26,.No. 6 • November/December 2002 ppfjlffl-f]^;, rj-r ci-s'.n.: -/: •:.'.% hstisnorm-i nor mm . o THE COMMITTEE FOR THE SCIENTIFIC INVESTIGATION OF CLAIMS OF THE PARANORMAL AT THE CENTER FOR INQUIRY-INTERNATIONAL (ADJACENT TO THE STATE UNIVERSITY OF NEW YORK AT BUFFALO) • AN INTERNATIONAL ORGANIZATION Paul Kurtz, Chairman; professor emeritus of philosophy. State University of New York at Buffalo Barry Karr, Executive Director Joe Nickell, Senior Research Fellow Massimo Polidoro, Research Fellow Richard Wiseman, Research Fellow Lee Nisbet Special Projects Director FELLOWS James E. Alcock,* psychologist. York Univ., Consultants, New York. NY Irmgard Oepen, professor of medicine Toronto Susan Haack. Cooper Senior Scholar in Arts (retired), Marburg, Germany Jerry Andrus, magician and inventor, Albany, and Sciences, prof, of philosophy, University Loren Pankratz, psychologist, Oregon Health Oregon of Miami Sciences Univ. Marcia Angell, M.D., former editor-in-chief. C. E. M. Hansel, psychologist. Univ. of Wales John Paulos, mathematician, Temple Univ. New England Journal of Medicine Al Hibbs, scientist, Jet Propulsion Laboratory Steven Pinker, cognitive scientist, MIT Robert A. Baker, psychologist. Univ. of Douglas Hofstadter, professor of human Massimo Polidoro, science writer, author, Kentucky understanding and cognitive science, executive director CICAP, Italy Stephen Barrett, M.D., psychiatrist, author. Indiana Univ. Milton Rosenberg, psychologist, Univ. of consumer advocate, Allentown, Pa. Gerald Holton, Mallinckrodt Professor of Chicago Barry Beyerstein,* biopsychologist, Simon Physics and professor of history of science, Wallace Sampson, M.D., clinical professor of Harvard Univ. Fraser Univ., Vancouver, B.C., Canada medicine, Stanford Univ., editor, Scientific Ray Hyman.* psychologist, Univ.
    [Show full text]
  • Marion REVOLLE
    Algorithmic information theory for automatic classification Marion Revolle, [email protected] Nicolas le Bihan, [email protected] Fran¸cois Cayre, [email protected] Main objective Files : any byte string in a computer (text, music, ..) & Similarity measure in a non-probabilist context : Similarity metric Algorithmic information theory : Kolmogorov complexity % 1.1 Complexity 1.2 GZIP 1.3 Examples GZIP : compression algorithm = DEFLATE + Huffman. Given x a file string of size jxj define on the alphabet Ax A- x = ABCA BCCB CABC of size αx. A- DEFLATE L(x) = 6 Z(-1! 1) A- Simple complexity Dictionary compression based on LZ77 : make ref- x : L(A) L(B) L(C) Z(-3!3) L(C) Z(-6!5) K(x) : Kolmogorov complexity : the length of a erences from the past. ABC ABCC BCABC shortest binary program to compute x. B- y = ABCA BCAB CABC DEFLATE(x) generate two kinds of symbol : L(x) : Lempel-Ziv complexity : the minimal number L(y) = 5 of operations making insert/copy from x's past L(a) : insert the element a in Ax = Literal. y : L(A) L(B) L(C) Z(-3!3) Z(-6!6) to generate x. Z(-i ! j) : paste j elements, i elements before = Refer- ABC ABC ABCABC ence of length j. L(yjx) = 2 yjx : Z(-12!6) Z(-12!6) B- Conditional complexity ABCABC ABCABC K(xjy) : conditional Kolmogorov complexity : the B- Complexity C- z = MNOM NOMN OMNO length of a shortest binary program to compute Number of symbols to compress x with DEFLATE L(z) = 5 x is y is furnished as an auxiliary input.
    [Show full text]