UTF-8 Compression Huffman’S Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
Computer Mathematics Week 5 Source coding and compression Department of Mechanical and Electrical System Engineering last week Central Processing Unit data bus address IR DR PC AR bus binary representations of signed numbers increment PC registers 0 4 sign-magnitude, biased 8 CU 16 20 one’s complement, two’s complement 24 operation 28 ALU select Random PSR Access signed binary arithmetic Memory negation Universal Serial Bus Input / Output PCI Bus Controller addition, subtraction Mouse Keyboard, HDD GPU, Audio, SSD signed overflow detection multiplication, division width conversion sign extension floating-point numbers 2 this week Central Processing Unit data bus address coding theory IR DR PC AR bus increment PC registers 0 4 source coding 8 CU 16 20 24 28 operation ALU information theory concept select Random PSR Access information content Memory Universal Serial Bus Input / Output PCI Bus binary codes Controller Mouse Keyboard HDD GPU Audio SSD Net numbers text variable-length codes UTF-8 compression Huffman’s algorithm 3 coding theory coding theory studies the encoding (representation) of information as numbers and how to make encodings more efficient (source coding) reliable (channel coding) secure (cryptography) 4 binary codes a binary code assigns one or more bits to represent some piece of information number digit, character, or other written symbol colour, pixel value audio sample, frequency, amplitude etc. codes can be arbitrary, or designed to have desirable properties fixed length, or variable length static, or generated dynamically for specific data 5 a code for numbers movement 1 1 encoded position 1 why a code for numbers? 0 binary is ambiguous between values e.g., electro-mechanical sensor moves from position 7 to 8 no guarantee all bits change simultaneously a better code would have one bit difference between adjacent positions also known as minimum-change codes 6 reflected binary code begin with a single bit encoding two positions (0 and 1) this has the property we need: 3-bit 2-bit 1-bit – only one bit changes between adjacent positions 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 0 to double the size of the code... 1 1 0 reflect repeat the bits by reflecting them 1 1 1 1 0 1 reflect – the two central positions now have the same code 1 0 0 prefix the first half with 0, and the second half with 1 this still has the property we need: – only the newly-added first bit changes between the two central positions this is Gray code, named after its inventor 7 Gray code and binary conversion from binary to Gray code add (modulo 2) each input digit to the input digit on its left (there is an implicit 0 on the far left) implicit 0 binary = 0 1 0 0 0 1 0 0 0 + Gray = 11 0 0 1 1 0 0 + binary = 0 1 0 0 0 1 0 0 0 conversion from Gray code to binary add (modulo 2) each input digit to the output digit on its left (there is an implicit 0 on the far left) 8 binary and Gray code rotary encoders 9 high-resolution Gray code rotary encoder 10 fixed length codes for text Baudot code (5 bits) ! International Telegraph Alphabet (ITA) No. 2 (5 bits) ! ASCII (7+1 bits) 11 fixed length codes for text: ASCII ASCII (American Standard Code for Information Interchange) latin alphabet, arabic digits common western punctuation and mathematical symbols the ASCII codes, in hexadecimal: least significant digit 0123456789ABCDEF 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT NL VT NP CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SP ! " # $ % & ’ () * + , - . / 3 0123456789:;<=>? 4 @ABCDEFGHIJKLMNO 5 PQRSTUVWXYZ[ n ] ^ 6 ‘ ABCDEFGHIJKLMNO most significant digit 7 PQRSTUVWXYZ{|}~ DEL text: H E L L O ␣ T H E R E ! hex encoding: 48 65 6C 6C 6F 20 74 68 65 72 65 21 12 fixed length codes for text: UTF-32 Unicode is a mapping of all written symbols to numbers modern languages, ancient languages, math/music/art symbols (and far too many useless emoticons) 21-bit numbering, usually written ‘U+hexadecimal-code’ first 127 Unicode characters are the same as ASCII (hexadecimal) name symbol Unicode ASCII asterisk * U+2A 2A star F U+2605 n/a shooting star U+1F320 n/a UTF (Unicode Transformation Format) is an encoding of Unicode in binary UTF-32 is a fixed length encoding each character’s number is represented directly as a 32-bit integer 13 variable length codes for text: UTF-8 UTF-32 is wasteful most western text uses the 7-bit ASCII subset U+20 – U+7F most other modern text uses characters with numbers < 216 – kana are in the range U+3000 – U+30FF – kanji are in the range U+4E00 – U+9FAF UTF-8 is a variable length encoding of Unicode characters are between 1 and 4 bytes long first byte encodes the length of the character most common Unicode characters are encoded as one, two or three UTF-8 bytes number of bits code points byte 1 byte 2 byte 3 byte 4 7 0016 – 7F16 0XXXXXXX2 11 (5 + 6) 008016 – 07FF16 110XXXXX2 10XXXXXX2 16 (4 + 6 + 6) 080016 – FFFF16 1110XXXX2 10XXXXXX2 10XXXXXX2 21 (3 + 6 + 6 + 6) 01000016 – 10FFFF16 11110XXX2 10XXXXXX2 10XXXXXX2 10XXXXXX2 " number of leading ‘1’s encodes the length in bytes U+2605 (‘F’) = 0010 011000 0001012 = 111000102 100110002 100001012 = E2 98 8516 14 information theory how much information is carried by a message? information is conveyed when you learn something new if the message is predictable (e.g., series of ‘1’s) minimum information content zero bits of information per bit transmitted if the message is unpredictable (e.g., series of random bits) maximum information content one bit of information per bit transmitted if the probability of receiving a given bit (or symbol, or message) ‘x’ is Px, then 1 information content of x = log2 bits Px 15 variable length encoding of known messages messages often use only a few symbols from a code e.g., in typical English text 12.702% of letters are ‘e’ 0.074% of letters are ‘z’ considering their probability in English text ‘e’ conveys few bits of information (log2(1=0:12702) = 2:98 bits) ‘z’ conveys many bits of information (log2(1=0:00074) = 10:4 bits) and yet they are both encoded using 8 bits of UTF-8 (or ASCII) if we know the message in advance we could construct an encoding optimised for it make high-probability (low information) symbols use few bits make low-probability (high information) symbols use more bits this is the goal of source coding, or ‘compression’ 16 binary encodings as decision trees follow the path from the root to the symbol in the decision tree a 0 means ‘go left’ and a 1 means ‘go right’ root (start here) 0 1 UTF-8 multibyte characters 80-F7 digits and punctuation 1 00-3F capital letters 0 1 40-5F 0 1 1 0 60-63 0 1 7C-7F 1 0 . .d e f g . .x y z { . 01100101 (6516) 01111010 (7A16) for English, we really want ‘E’ to have a shorter path (be closer to the root) than ‘Z’ 17 Huffman coding Huffman codes are variable-length codes that are optimal for a known message the message is analysed to find the frequency (probability) of each symbol a binary encoding is then constructed, from least to most probable symbols create a list of ‘subtrees’ containing just each symbol labeled with its probability repeatedly – remove the two lowest-probability subtrees s; t – combine them into a new subtree p – label p with the sum of the probabilities of s and t – insert p into the list until only one subtree remains the final remaining subtree in the list is an optimal binary encoding for the message 18 Huffman coding example known message ‘GOOGLING’: f symbol P (s) 1 LIN 1=8 (3:00 bits of information) 2 O 2=8 (2:00 bits) 3 G 3=8 (1:42 bits) repeatedly combine two lowest-probability subtrees 1. 1 1 1 2 3 code table: symbol encoding l i n o g N 00 2. 12 2 3 n o g 1 1 O 01 l i L 100 3. 2 3 3 g I 101 1 1 1 2 l i n o G 11 4. 3 5 G O O G L I N G 12 2 3 encoded message: 11 01 01 11 100 101 00 11 n o g 1 1 l i message length: 64 bits (UTF-8) vs. 18 bits (encoded) 5. 8 compression ratio: 64=18 = 3:6 0 1 3 5 0 1 0 1 12 2 3 actual information: 3 × 3:0 + 2 × 2:0 + 3 × 1:42 = 17:26 bits n o 0 1 g .. 1 1 encoding optimal? d17:26e = 18 (^) l i 19 common source coding algorithms zip Huffman coding with a dynamic encoding tree tree updated to keep most recent common byte sequences efficiently encoded video mp2, or MPEG-2 (Moving Picture Expert’s Group layer 2), for DVDs mp4, or MPEG-4, for web video avi (Audio Video Interleave), wmv (Windows Media Video), flv (Flash Video) lossy audio (decoded audio has degraded quality and is permanently damaged) mp3, or MPEG-3 (Motion Picture Expert’s Group layer 3) AAC (Apple Audio Codec), wma (Windows Media Audio) lossless audio (decoded audio is identical to the original) ALAC (Apple Lossless Audio Codec) FLAC (Free Lossless Audio Codec) (CODEC = coder-decoder, a program that both encodes and decodes data) 20 next week Central Processing Unit data bus address coding theory IR DR PC AR bus increment PC registers 0 channel coding 4 8 CU 16 20 24 28 operation information theory concept ALU select Random PSR Access Hamming distance Memory Universal Serial Bus Input / Output PCI Bus error detection Controller Mouse Keyboard HDD GPU Audio SSD Net motivation parity cyclic redundancy checks error correction motivation block parity Hamming codes 21 homework practice converting between binary and gray code write a Python program to do it practice encoding and decoding some UTF-8 characters write a Python program to do it practice constructing Huffman codes write a Python program to do it ask about anything you do not understand it will be too late for you to try to catch up later! I am always happy to explain things differently and practice examples with you 22 glossary ASCII — American Standard Code for Information Interchange.