Would UTF-8 Be Able to Support the Inclusion of a Vast Alien Language with Millions of New Characters?
Total Page:16
File Type:pdf, Size:1020Kb
_ Software Engineering Stack Exchange Here's how it works: is a question and answer site for professionals, academics, and students working within the systems development life cycle. Join them; it only takes a minute: Anybody can ask Anybody can The best answers are voted a question answer up and rise to the top Sign up Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters? In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters? (Of course, we do not know if aliens actually have languages, if or how they communicate, but for the sake of the argument, please just imagine they do.) For instance, if their language consisted of millions of newfound glyphs, symbols, and/or combining characters, could UTF-8 theoretically be expanded in a non-breaking way to include these new glyphs and still support all existing software? I'm more interested in if the glyphs far outgrew the current size limitations and required more bytes to represent a single glyph. In the event UTF-8 could not be expanded, does that prove that the single advantage over UTF-32 is simply size of lower characters? unicode utf-8 edited May 23 '17 at 12:40 asked Nov 24 '15 at 12:18 Community ♦ Qix 1 722 9 28 15 "support their languages" (my emphasis)... How many? Are we sure the languages can be broken down to characters? Maybe the language is based on spatial relations. - see Ted Chiang "Story of Your Life", Stories of Your Life and Others. At best, this is simply a max-things-in-X-bytes question (off-topic). At worst, it's speculative nonsense. (not clear what you're asking) – Scant Roger Nov 24 '15 at 13:12 6 @ScantRoger The accepted answer does a fine job at answering the question as it was intended. – Qix Nov 24 '15 at 13:13 10 The accepted answer does a fine job of telling us the facts of UTF-8, UTF-16, and UTF-32. You could simply look this up on Wikipedia. As for "alien invasion", I don't see how the answer addresses it at all. – Scant Roger Nov 24 '15 at 13:17 10 Related (on Stack Overflow): Is UTF-8 enough for all common languages? – yannis ♦ Nov 24 '15 at 13:17 9 Unicode does not support languages, it supports characters - glyphs used to represent meaning in written form. Many human languages does not have a script and hence cannot be supported by unicode. Not to mention many animals communicate but don't have a written language. Communication by say illustrations or wordless comics cannot be supported by unicode since the set of glyphs are not finite. By definition we don't know how aliens communicate, so your question is impossible to answer. If you just want to know how many distinct characters unicode can support, you should probably clarify :) – JacquesB Nov 24 '15 at 16:41 5 Answers The Unicode standard has lots of space to spare. The Unicode codepoints are organized in “planes” and “blocks”. Of 17 total planes, there are 11 currently unassigned. Each plane holds 65,536 characters, so there's realistically half a million codepoints to spare for an alien language (unless we fill all of that up with more emoji before first contact). As of Unicode 8.0, only 120,737 code points have been assigned in total (roughly 10% of the total capacity), with roughly the same amount being unassigned but reserved for private, application-specific use. In total, 974,530 codepoints are unassigned. UTF-8 is a specific encoding of Unicode, and is currently limited to four octets (bytes) per code point, which matches the limitations of UTF-16. In particular, UTF-16 only supports 17 planes. Previously, UTF-8 supported 6 octets per codepoint, and was designed to support 32768 planes. In principle this 4 byte limit could be lifted, but that would break the current organization structure of Unicode, and would require UTF-16 to be phased out – unlikely to happen in the near future considering how entrenched it is in certain operating systems and programming languages. The only reason UTF-16 is still in common use is that it's an extension to the flawed UCS-2 encoding which only supported a single Unicode plane. It otherwise inherits undesirable properties from both UTF-8 (not fixed-width) and UTF-32 (not ASCII compatible, waste of space for common data), and requires byte order marks to declare endianness. Given that despite these problems UTF-16 is still popular, I'm not too optimistic that this is going to change by itself very soon. Hopefully, our new Alien Overlords will see this impediment to Their rule, and in Their wisdom banish UTF-16 from the face of the earth. edited Nov 24 '15 at 13:47 answered Nov 24 '15 at 12:48 amon 67.4k 16 124 210 7 Actually, UTF-8 is limited to only a part of even the 4-byte-limit, in order to match UTF-16. Specifically, to 17/32 of it, slightly more than half. – Deduplicator Nov 24 '15 at 15:19 5 Outside of Windows I know of no other OS where either the OS or the majority of programs on the OS use UTF16. OSX programs are typically UTF8, Android programs are typically UTF8, Linux are typically UTF8. So all we need is for Windows to die (it already is sort of dead in the mobile space) – slebetman Nov 25 '15 at 3:14 23 Unless we fill all of that up with more emoji before first contact... There you have it. The most significant threat to peaceful interaction with aliens is emoji. We're doomed. – rickster Nov 25 '15 at 6:03 13 @slebetman Not really. Anything JVM-based uses UTF-16 (Android as well, not sure why you say it doesn't), JavaScript uses UTF-16, and given that Java and JavaScript are the most popular languages, UTF-16 is not going anywhere anytime soon. – Malcolm Nov 25 '15 at 8:34 5 @Kaiserludi "Most linux code uses UTF32 for unicode", yeah, no. Seriously where the hell did you get that idea? There is not even a wfopen syscall or anything else, it's UTF8 all the way. Hell even Python and Java - both which define strings as UTF-16 due to historical reasons - do not store strings as UTF-16 except when necessary.. large memory benefits and no performance hits (and that despite the additional code to handle conversions - memory is expensive, CPU is cheap). Same goes for Android - the NDK's JString is UTF8, mostly because Google engineers are not insane. – Voo Nov 25 '15 at 22:19 If UTF-8 is actually to be extended, we should look at the absolute maximum it could represent. UTF-8 is structured like this: Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (shamelessly copied from the RFC.) We see that the first byte always controls how many follow-up bytes make up the current character. If we extend it to allow up to 8 bytes we get the additional non-Unicode representations 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx Calculating the maximum possible representations that this technique allows we come to 10000000₂ + 00100000₂ * 01000000₂ + 00010000₂ * 01000000₂^2 + 00001000₂ * 01000000₂^3 + 00000100₂ * 01000000₂^4 + 00000010₂ * 01000000₂^5 + 00000001₂ * 01000000₂^6 + 00000001₂ * 01000000₂^7 or in base 10: 128 + 32 * 64 + 16 * 64^2 + 8 * 64^3 + 4 * 64^4 + 2 * 64^5 + 1 * 64^6 + 1 * 64^7 which gives us the maximum amount of representations as 4,468,982,745,216. So, if these 4 billion (or trillion, as you please) characters are enough to represent the alien languages I am quite positive that we can, with minimal effort, extend the current UTF-8 to please our new alien overlords ;-) edited Sep 20 '16 at 11:23 answered Nov 24 '15 at 16:21 Qix Boldewyn 722 9 28 417 3 7 3 They shall be pleased. Thank you for this! – Qix Nov 24 '15 at 16:23 8 Currently UTF-8 is limited to only code points until 0x10FFFF - but that is only for compatibility with UTF-16. If there was a need to extend it, there is no ambiguity about how to extend it with code points until 0x7FFFFFFF (that's 2³¹-1). But beyond that I have seen conflicting definitions. One definition I have seen has 111111xx as a possible first byte followed by five extension bytes for a maximum of 2³² code points. But that is only compatible with the definition you mention for the first 2³¹ code points. – kasperd Nov 24 '15 at 16:49 2 Yes, Wikipedia says something about UTF-16, when really they mean Unicode or ISO 10646 (depending on context). Actually, since RFC 3629, UTF-8 is undefined beyond U+10FFFF (or F4 8F BF BF in UTF-8 bytes). So, everything I mention here beyond that is pure speculation. Of course, someone could think of other extensions, where a high first byte signifies some other structure following (and hopefully not destroying self sync in the process).