Bérci Norbert: Számábrázolás, Karakterkódolás Jegyzet

Total Page:16

File Type:pdf, Size:1020Kb

Bérci Norbert: Számábrázolás, Karakterkódolás Jegyzet Sz´am´abr´azol´as ´es karakterk´odol´as (jegyzet) B´erci Norbert 2014. szeptember 15-16-i ´ora anyaga Tartalomjegyz´ek 1. Sz´amrendszerek1 1.1. A sz´amrendszer alapja ´esa sz´amjegyek........................2 1.2. Alaki- ´es helyi´ert´ek...................................2 1.3. Eg´esz sz´amok le´ır´asa..................................2 1.4. Nem eg´eszsz´amokle´ır´asa...............................3 1.5. Atv´alt´assz´amrendszerek´ k¨oz¨ott............................3 1.6. Feladatok........................................3 1.7. Sz´amrendszerek pontoss´aga..............................4 2. M´ert´ekegys´egek4 3. G´epi sz´am´abr´azol´as4 3.1. Nem negat´ıv eg´esz sz´amok ´abr´azol´asa........................5 3.2. Negat´ıv eg´esz sz´amok´abr´azol´asa...........................5 3.3. Eg´esz sz´amok adat´abr´azol´asainak ¨osszehasonl´ıt´asa.................7 3.4. Eg´esz sz´amok ´abr´azol´asihat´arai´espontoss´aga...................7 3.5. A lebeg}opontos sz´am´abr´azol´as.............................9 3.6. Az IEEE 754 lebeg}opontos sz´am´abr´azol´as...................... 13 3.7. Numerikus matematika................................ 14 4. Karakterek ´esk´odol´asuk 15 4.1. Karakterek ´es karakterk´eszletek............................ 15 4.2. Karakterek k´odol´asa.................................. 15 4.3. Klasszikus k´odt´abl´ak.................................. 15 4.4. A Unicode........................................ 16 4.5. Sz¨ovegf´ajlok....................................... 17 4.6. Feladatok........................................ 17 1. Sz´amrendszerek A sz´amrendszer [numeral system - nem numeric system!] a sz´am(mint matematikai fogalom)´ırott form´abant¨ort´en}omegjelen´ıt´es´ere alkalmas m´odszer. Ebben a r´eszben a helyi´ert´eken (poz´ıci´on) alapul´osz´amrendszereket t´argyaljuk. L´eteznek nem poz´ıci´onalapul´osz´amrendszerek is, ilyenek p´eld´aul a sorrendis´egen alapul´or´omai sz´amok, de ezekkel a tov´abbiakban nem foglalkozunk. 0Revision : 60 (Date : 2014 − 09 − 2011 : 16 : 32 + 0200(Sat; 20Sep2014)) 1 1.1. A sz´amrendszer alapja ´es a sz´amjegyek A helyi´ert´eken alapul´osz´amrendszerek k´etlegfontosabb param´etere a sz´amrendszer alapja [base, radix] ´es az egyes poz´ıci´okba ´ırhat´osz´amjegyek [digit]. Ezek nem fuggetlenek:¨ a sz´amrendszer alapja meghat´arozza az egyes poz´ıci´okba ´ırhat´osz´amjegyek maximum´at: ha a sz´amrendszer A alap´u, akkor a legkisebb felhaszn´alhat´osz´amjegy a 0, a legnagyobb az A − 1. 1.1.1. p´elda. A t´ızes sz´amrendszerben a 0; 1; 2; 3; 4; 5; 6; 7; 8; 9 sz´amjegyek szerepelhetnek, a nyolcas sz´amrendszerben a 0; 1; 2; 3; 4; 5; 6; 7 sz´amjegyek k¨ozul¨ v´alaszthatunk, m´ıg a kettesben a 0; 1 a k´etlehets´eges sz´amjegy. T´ızn´elnagyobb alap´usz´amrendszerek eset´eben a sz´amjegyek halmaz´at9 ut´anaz ABC bet}uivel eg´esz´ıtjuk¨ ki. A kis ´esnagybet}ukk¨oz¨ott ´altal´aban nem teszunk¨ kul¨ ¨onbs´eget, b´aregyes nagy alap´usz´amrendszerekn´el erre m´egis szuks´eglehet.¨ 1.1.2. p´elda. A tizenhatos sz´amrendszerben haszn´alhat´o sz´amjegyek": 0, 1, 2, 3, 4, 5, 6, 7, 8, " 9, a, b, c, d, e, f (vagy 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F). Ha az a sz¨ovegk¨ornyezetb}olnem egy´ertelm}u, a sz´amrendszer alapj´atsz¨ogletes z´ar´ojelben a jobb als´oindexbe t´eve jel¨olhetjuk.¨ P´eld´aul: 5221[10], 726[8] vagy 80[16]. A j´olismert t´ızes alap´u decim´alis sz´amrendszeren k´ıvul¨ az informatik´abana leggyakrabban haszn´altak a k¨ovetkez}ok: a kettes alap´u bin´aris, a nyolcas alap´u okt´alis ´esa tizenhatos alap´u hexadecim´alis. Az el}oz}oekben eml´ıtett, indexben t¨ort´en}osz´amrendszer megad´asmellett bin´a- ris sz´amrendszer jel¨ol´es´ere haszn´alatos a b postfix, okt´alis esetben egy kezd}o0 szerepeltet´ese, hexadecim´alis sz´amok eset´ena 0x, 0X prefixek vagy a h postfix. Az informatik´abanezeket a jel¨ol´eseket haszn´aljuk a legink´abb. P´eld´aul: 100b (bin´aris), 065 (okt´alis), 0x243 (hexadecim´alis), 0X331 (hexadecim´alis), 22h (hexadecim´alis). Ha sem a sz´amel}ott, sem ut´ana,sem az index´eben nincs jel¨olve, akkor decim´alis sz´amrendszerben ´ertelmezzuk¨ a le´ırtakat. 1.2. Alaki- ´eshelyi´ert´ek Egy adott sz´amrendszerben le´ırt sz´ameset´eben egy sz´amjegy ´ert´eke egyenl}oa sz´amjegy alaki ´ert´ek´enek ´es helyi´ert´ek´enek szorzat´aval. A sz´amjegy alaki ´ert´eke a sz´amjegyhez tartoz´o´ert´ek, a helyi´ert´ek pedig a sz´amrendszer alapj´anak a poz´ıci´oszerinti hatv´anya. A 0; 1;:::; 9 eset´eben az alaki ´ert´ek egy´ertelm}u, a bet}ukkel kieg´esz´ıtett esetben ezek: a=10, b=11, c=12, d=13 stb. 1.2.1. p´elda. A t´ızes sz´amrendszerben fel´ırt 32 sz´ameset´eben a 3 helyi´ert´eke 101 = 10, mivel az jobbr´ola m´asodik poz´ıci´onszerepel (´es a helyi´ert´ekeket a nulladik hatv´anyt´olind´ıtjuk), ´ıgy ebben a p´eld´aban a 3 sz´amjegy ´ert´eke: 3 · 101 = 3 · 10 = 30. 1.2.2. p´elda. A t´ızes sz´amrendszerben fel´ırt 32 sz´ameset´eben a 2 helyi´ert´eke 100 = 1, mivel az jobbr´olaz els}opoz´ıci´onszerepel (´esa helyi´ert´ekeket a nulladik hatv´anyt´olind´ıtjuk), ´ıgyebben a p´eld´aban a 2 sz´amjegy ´ert´eke: 2 · 100 = 2 · 1 = 2. 1.3. Eg´esz sz´amok le´ır´asa Eg´eszsz´amokat ´altal´anos esetben az anan−1 : : : a1a0 alakban ´ırhatunk fel, ´es az ´ıgyfel´ırt sz´am ´ert´eke (A alap´usz´amrendszert felt´etelezve): n n−1 1 0 (an · A ) + (an−1 · A ) + ··· + (a1 · A ) + (a0 · A ) ami nem m´as,mint a le´ırt sz´amjegyek (az el}oz}oekben megismert m´odon kisz´amolt) ´ert´ekeinek ¨osszege. 2 1 0 1.3.1. p´elda. Trivi´alis p´elda: 405[10] = 4 · 10 + 0 · 10 + 5 · 10 = 400 + 5 2 1 0 1.3.2. p´elda. 405[8] = 4 · 8 + 0 · 8 + 5 · 8 = 256 + 5 = 261 6 5 4 3 2 1 0 1.3.3. p´elda. 1001101[2] = 1 · 2 + 0 · 2 + 0 · 2 + 1 · 2 + 1 · 2 + 0 · 2 + 1 · 2 = 64 + 8 + 4 + 1 = 77 1.3.4. p´elda. 0xA3 = 10 · 161 + 3 · 160 = 10 · 16 + 3 · 1 = 163 A negat´ıveg´esz sz´amokat ´ugy ´ırjuk le, hogy abszol´ut ´ert´ekuket¨ az el}oz}om´odonfel´ırjuk valamely sz´amrendszerben, majd el´e − jelet teszunk¨ (b´arezt a jel¨ol´est a t´ızes sz´amrendszeren k´ıvul¨ a gyakorlatban nem alkalmazzuk). 2 1.4. Nem eg´esz sz´amok le´ır´asa Az eg´esz sz´amokn´almegismert fel´ır´asim´odszert kiterjeszthetjuk¨ ´ugy, hogy a helyi´ert´ekek meg- ad´as´an´alnem ´allunk meg a nulladik hatv´anyn´al, hanem folytatjuk azt a negat´ıv hatv´anyokra is, ´ıgy lehet}os´egunk¨ ad´odik nem eg´esz sz´amok le´ır´as´ara. Altal´anos´ esetben teh´atennek alakja: anan−1 : : : a1a0a−1 : : : a−k, ´esaz ´ıgy fel´ırt sz´am´ert´eke (A alap´usz´amrendszert felt´etelezve): n n−1 1 0 −1 −k an · A + an−1 · A + ··· + a1 · A + a0 · A + a−1 · A + ··· + a−k · A Annak ´erdek´eben, hogy a mindk´et v´eg´en (eg´esz- illetve t¨ort r´esz) tetsz}olegesen b}ov´ıthet}ofel´ır´as egy´ertelm}ulegyen, ennek a k´etr´esznek a hat´ar´atjel¨oljuk¨ tizedesvessz}ovel. Mi a magyar he- lyes´ır´assal ellent´etben, a nem eg´esz sz´amokfelsorol´as´anakk¨onnyebb olvashat´os´aga´erdek´eben a tov´abbiakban a tizedespontos1 jel¨ol´est fogjuk alkalmazni. (Pl. 1,6, 2,4, 5,9 helyett 1:6; 2:4; 5:9) 2 1 0 −1 −2 1.4.1. p´elda. Trivi´alis p´elda: 405:23[10] = 4 · 10 + 0 · 10 + 5 · 10 + 2 · 10 + 3 · 10 = 1 1 4 · 100 + 5 · 1 + 2 · 10 + 3 · 100 2 1 0 −1 −2 1 1 1.4.2. p´elda. 405:23[8] = 4 · 8 + 0 · 8 + 5 · 8 + 2 · 8 + 3 · 8 = 4 · 64 + 5 · 1 + 2 · 8 + 3 · 82 = 2 3 19 256 + 5 + 8 + 64 = 261 64 = 261:296875 6 5 4 3 2 1 0 −1 −2 1.4.3. p´elda. 1001101:01[2] = 1·2 +0·2 +0·2 +1·2 +1·2 +0·2 +1·2 +0·2 +1·2 = 1 64 + 8 + 4 + 1 + 4 = 77:25 Negat´ıv nem eg´esz sz´amokle´ır´asaa negat´ıveg´esz sz´amok le´ır´as´ahoz hasonl´oana − jel sz´amel´e ´ır´as´aval t¨ort´enik (amit szint´en csak a t´ızes sz´amrendszer eset´eben haszn´alunk). 1.5. Atv´alt´as´ sz´amrendszerek k¨oz¨ott Az adott sz´amrendszerb}olt´ızes sz´amrendszerbe v´alt´astaz 1.3 ´es az 1.4 r´eszek p´eld´aiban hallgat´o- lagosan m´arbemutattuk. A ford´ıtott ´atv´alt´asra nem t´erunk¨ ki (a m´odszer k¨onnyen kital´alhat´o, l´asd 1.6.5. feladat). Az ´atv´alt´asnagym´ert´ekben egyszer}us¨odik, ha bin´arisb´olokt´alis vagy hexadecim´alis sz´am- rendszerbe kell ´atv´altani: egyszer}uen h´armas´aval (okt´alis esetben) vagy n´egyes´evel (hexadecim´a- lis esetben) kell a bin´aris sz´amjegyeket csoportos´ıtani, ´es az ´ıgy k´epzett csoportokat ´atv´altani: 1.5.1.
Recommended publications
  • Hieroglyphs for the Information Age: Images As a Replacement for Characters for Languages Not Written in the Latin-1 Alphabet Akira Hasegawa
    Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-1-1999 Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet Akira Hasegawa Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Hasegawa, Akira, "Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet" (1999). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Hieroglyphs for the Information Age: Images as a Replacement for Characters for Languages not Written in the Latin- 1 Alphabet by Akira Hasegawa A thesis project submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Printing Management and Sciences in the College of Imaging Arts and Sciences of the Rochester Institute ofTechnology May, 1999 Thesis Advisor: Professor Frank Romano School of Printing Management and Sciences Rochester Institute ofTechnology Rochester, New York Certificate ofApproval Master's Thesis This is to certify that the Master's Thesis of Akira Hasegawa With a major in Graphic Arts Publishing has been approved by the Thesis Committee as satisfactory for the thesis requirement for the Master ofScience degree at the convocation of May 1999 Thesis Committee: Frank Romano Thesis Advisor Marie Freckleton Gr:lduate Program Coordinator C.
    [Show full text]
  • Plain Text & Character Encoding
    Journal of eScience Librarianship Volume 10 Issue 3 Data Curation in Practice Article 12 2021-08-11 Plain Text & Character Encoding: A Primer for Data Curators Seth Erickson Pennsylvania State University Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/jeslib Part of the Scholarly Communication Commons, and the Scholarly Publishing Commons Repository Citation Erickson S. Plain Text & Character Encoding: A Primer for Data Curators. Journal of eScience Librarianship 2021;10(3): e1211. https://doi.org/10.7191/jeslib.2021.1211. Retrieved from https://escholarship.umassmed.edu/jeslib/vol10/iss3/12 Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License. This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Journal of eScience Librarianship by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ISSN 2161-3974 JeSLIB 2021; 10(3): e1211 https://doi.org/10.7191/jeslib.2021.1211 Full-Length Paper Plain Text & Character Encoding: A Primer for Data Curators Seth Erickson The Pennsylvania State University, University Park, PA, USA Abstract Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability.
    [Show full text]
  • UTF-8 from Wikipedia, the Free Encyclopedia
    UTF-8 From Wikipedia, the free encyclopedia UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike.[1] The encoding is variable-length and uses 8-bit code units. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8- bit.[2] UTF-8 is the dominant character encoding for the World Wide Web, accounting for 89.1% of all Web pages in May 2017 (the most popular East Asian encodings, Shift JIS and GB 2312, have 0.9% and 0.7% respectively).[4][5][3] The Internet Mail Consortium (IMC) recommended that all e-mail programs be able to display and create mail using UTF-8,[6] and the W3C recommends UTF-8 as the default encoding in XML and HTML.[7] UTF-8 encodes each of the 1,112,064[8] valid code points in Unicode using one to four 8-bit bytes.[9] Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as '/' in filenames, '\' in escape sequences, and '%' in printf.
    [Show full text]
  • Junk Characters in Bb Annotate for Several Non-English Languages
    Junk Characters in Bb Annotate for Several non-English Languages Date Published: Jul 31,2020 Category: Planned_First_Fix_Release:Learn_9_1_3900_0_Release,SaaS_v3800_15_0; Product:Grade_Center_Learn,Language_Packs_Learn; Version:Learn_9_1_Q4_2019,Learn_9_1_Q2_2019,SaaS Article No.: 000060296 Product: Blackboard Learn Release: 9.1;SaaS Service Pack(s): Learn 9.1 Q4 2019 (3800.0.0), Learn 9.1 Q2 2019 (3700.0.0), SaaS Description: Incorrect or non-textual font symbols such as §, © and ¶ appeared in the Blackboard Annotate User Interface when using several non-English Language Packs, including Arabic, Spanish, Korean, and Japanese. Steps to Replicate: Prerequisite: The Learn environment has converted to Blackboard Annotate. 1. Log into Blackboard Learn as System Administrator 2. Set the Language Pack to a non-English language, such as Arabic, Spanish, Korean, or Japanese 3. Log in as Instructor 4. Navigate to a Course with Assignments 5. Grade any assignment using Blackboard Annotate Expected Behavior: The user interface displays proper characters for the language chosen. Observed Behavior: Symbols such as §, © and ¶, or characters from other languages appear. Symptoms: Incorrect characters appear in the Blackboard Annotate User Interface. Cause: Characters consist of one or more binary bytes indicating a location in a 'codepage' for a specific character encoding, such as CP252 for Arabic. Information regarding the encoding used needs to be sent by the server to the browser for it to use the correct codepage. If an incorrect codepage is used to look up the characters to be displayed, unintelligble characters known as "Mojibake" will appear because the locations in one codepage will not will not necessarily contain the same characters as another.
    [Show full text]
  • Unicode Identifiers and Reflection
    Unicode Identifiers And Reflection D1953R0 Reply to: [email protected] ​ Audience: SG-7, SG-15 Abstract SG-16 members are looking at extending the basic character set to support Unicode Identifiers. SG-7 is designing tools to convert identifiers to string (as well as the reverse). Therefore it will be necessary to be able to reflect (and reifere) on identifiers containing characters outside of the basic character sets. We explore solutions Unicode Identifiers Extending the basic character set is an area of ongoing research, but the general direction is: ● Based on TR31 ​ ● Specified Normalization of identifiers at compile time (more likely NFC) - to ensure consistent behavior (and mangling) across translation units and implementations. ● Limited to (assumed) UTF-encoded files, because no one wants mojibake in their identifiers The general motivation is not to encourage Unicode characters in identifiers but to ensure a consistent, reliable behavior across platforms. However, the goal of that paper is not specified how Unicode identifiers should work but rather to open a discussion as to how they should be reflected upon. C++ Text Model Primer For people not familiar with the work of SG-16, here is briefly how C++ handle text ● Each token is converted from the “source character encoding” (which is determined by the compiler in an implementation-defined way - GCC and Clang assumes UTF-8 by default while MSVC uses UTF BOMs and user locale to determine the “source character encoding” - Both GCC and MSVC provides flags to let their user override that behavior) ● To the internal character encoding, which is not specified but implied to be a Unicode encoding ● String literals are further converted to the _execution encoding_ whose character set is a subset of the internal character set.
    [Show full text]
  • Representing Information in English Until 1598
    In the twenty-first century, digital computers are everywhere. You al- most certainly have one with you now, although you may call it a “phone.” You are more likely to be reading this book on a digital device than on paper. We do so many things with our digital devices that we may forget the original purpose: computation. When Dr. Howard Figure 1-1 Aiken was in charge of the huge, Howard Aiken, center, and Grace Murray Hopper, mechanical Harvard Mark I calcu- on Aiken’s left, with other members of the Bureau of Ordnance Computation Project, in front of the lator during World War II, he Harvard Mark I computer at Harvard University, wasn’t happy unless the machine 1944. was “making numbers,” that is, U.S. Department of Defense completing the numerical calculations needed for the war effort. We still use computers for purely computational tasks, to make numbers. It may be less ob- vious, but the other things for which we use computers today are, underneath, also computational tasks. To make that phone call, your phone, really a com- puter, converts an electrical signal that’s an analog of your voice to numbers, encodes and compresses them, and passes them onward to a cellular radio which it itself is controlled by numbers. Numbers are fundamental to our digital devices, and so to understand those digital devices, we have to understand the numbers that flow through them. 1.1 Numbers as Abstractions It is easy to suppose that as soon as humans began to accumulate possessions, they wanted to count them and make a record of the count.
    [Show full text]
  • DD1320: Unicode Och UTF-8 I Python 2
    Per Sedholm DD1320 (tilda11) 2013-09-11 Unicode och UTF-8 i Python 2 Unicode i Python-kod För att allt ska fungera måste man använda både rätt teckenkodning på koden, och rätt datatyp i Python. Källkodens teckenkodning bestäms av programmet man skriver kod i. De fles- ta program (inkl. Python och Idle) förstår en kommentar i början av filen på formatet #- *- coding: utf-8- *- #- *- coding: iso-8859-1- *- #- *- coding: latin-1- *- och kommer då både att visa bokstäverna korrekt och spara rätt data på fil. Python har två olika strängtyper, str och unicode. En vanlig str-sträng in- nehåller bara en sekvens av tecken, Python vet inte ifall de kodar korrekta bokstäver eller inte. Ifall filen är sparad som Latin-1 kommer strängen "åäö" att innehålla tre element, men ifall den är sparad som UTF-8 kommer den att ha sex element: print len("åäö")#- *- coding: iso-8859-1- *- #3 print len("åäö")#- *- coding: utf-8- *- #6 Skriver man ut strängen, så kommer åtminstone den ena att visas som konstiga tecken. Använder man unicode-strängar, u"åäö", så vet Python hur många bokstäver strängen innehåller (3 st), och de kommer också att visas korrekt vid utskrift. Unicode-strängar kan skrivas på flera sätt: # Om filens teckenkodning stämmer data = u"räksmörgås" data = unicode("räksmörgås","UTF-8")## använd filens teckenkodning # Om man kan bokstävernas code point data = u"r\u00e4ksm\u00f6rg\u00e5s" # Om man vet hur teckenkodningenär definierad data = unicode("r\xc3\xa4ksm\xc3\xb6rg\xc3\xa5s","UTF-8") data = unicode("r\xe4ksm\xf6rg\xe5s","ISO-8859-1") # Som namngivna bokstäver(ges av’unicodedata.name(u"ä")’) data = u"r\N{LATIN SMALL LETTERA WITH DIAERESIS}ksm\N{LATIN SMALL LETTERO WITH DIAERESIS}rg\N{LATIN SMALL LETTERA WITH RING ABOVE}s" 1 Det finns även metoder i sträng-klasserna för att konvertera till (encode) och från (decode) olika teckenkodningar: Lokaler i Python-kod För att sortera svensk text korrekt, så måste svenska sorterings-regler an- vändas.
    [Show full text]
  • The Good, the Bad, and the Ugly
    THE GOOD, THE BAD, AND THE UGLY What Happened to Unicode and PHP 6 Andrei Zmievski ! PHP Community Conference ABOUT 1 YEAR AGO… “Hello PHP 5.4, open for all new stuff.” — Jani TIME OF DEATH March!11,!11:09:37!2010 GMT 5 YEARS EARLIER… PHP 5.0.0 released in July 2004 5 YEARS EARLIER… Firefox 1.0 released in November 2004 5 YEARS EARLIER… Chrome not even a twinkle in Google’s eye 5 YEARS EARLIER… Unicode version 4.0.1 WHAT IS UNICODE? and why do I need it? Unicode …is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. Unicode provides a unique number for every character: no matter what the platform, no matter what the program, no matter what the language. UNICODE STANDARD ! Developed by the Unicode Consortium ! Covers all major living scripts ! Version 6.0 has 109,000+ characters ! Capacity for 1 million+ characters ! Widely supported by standards & industry FEATURES ! Rich property set for every character ! Standard, unified encodings: UTF-8/16/32 ! Extensive rules and documents for implementation ! Everything works, as long as everyone follows the rules UNICODE != I18N ! Unicode simplifies development ! Unicode does not fix all internationalization problems TIME FORMATS ! USA: !"##$%&'& ! France: ()&## ! Japan: ()##$ ! Don’t forget to identify the time zone CURRENCY ! Symbol placement *+$,(-&.! ! Symbol length (1-15) (-&.!/0)1$2 ! Number width (-,.!2 ! Number precision: 3(-. ‣ Spain, Japan – 0 ‣ Mexico, Brazil – 2 ‣ Egypt, Iraq – 3 SORTING ! Swedish: z < ö ! German: ö < z ! Dictionary: öf < of ! Phonebook: of < öf ! Upper-first: A < a ! Lower-First: a < A ! Contractions: H < Z, but CH > CZ ! Expansions: OE < Œ < OF CLDR ! Hosted by Unicode Consortium ! Latest release: December 2010 (CLDR 1.9) ! 516 locales, with 187 languages and 166 territories WHY WEB NEEDS UNICODE MOJIBAKE もじばけ MOJIBAKE noun: phenomenon of incorrect, unreadable characters shown when computer software fails to render a text correctly according to its associated character encoding.
    [Show full text]
  • Programming with Unicode Documentation Release 2011
    Programming with Unicode Documentation Release 2011 Victor Stinner Oct 01, 2019 Contents 1 About this book 1 1.1 License..................................................1 1.2 Thanks to.................................................1 1.3 Notations.................................................1 2 Unicode nightmare 3 3 Definitions 5 3.1 Character.................................................5 3.2 Glyph...................................................5 3.3 Code point................................................5 3.4 Character set (charset)..........................................5 3.5 Character string.............................................6 3.6 Byte string................................................6 3.7 UTF-8 encoded strings and UTF-16 character strings..........................7 3.8 Encoding.................................................7 3.9 Encode a character string.........................................7 3.10 Decode a byte string...........................................8 3.11 Mojibake.................................................8 3.12 Unicode: an Universal Character Set (UCS)...............................9 4 Unicode 11 4.1 Unicode Character Set.......................................... 11 4.2 Categories................................................ 11 4.3 Statistics................................................. 12 4.4 Normalization.............................................. 12 5 Charsets and encodings 15 5.1 Encodings................................................ 15 5.2 Popularity...............................................
    [Show full text]
  • L'edi Vu Par Wikipedia.Fr Transfert Et Transformation Table Des Matières
    L'EDI vu par Wikipedia.fr Transfert et transformation Table des matières 1 Introduction 1 1.1 Échange de données informatisé .................................... 1 1.1.1 Définition ........................................... 1 1.1.2 Quelques organisations .................................... 1 1.1.3 Quelques normes EDI .................................... 2 1.1.4 Quelques protocoles de communication ........................... 2 1.1.5 Quelques messages courants ................................. 2 1.1.6 Articles connexes ....................................... 4 2 Organismes de normalisation 5 2.1 Organization for the Advancement of Structured Information Standards ............... 5 2.1.1 Membres les plus connus de l'OASIS ............................ 5 2.1.2 Standards les plus connus édictés par l'OASIS ........................ 6 2.1.3 Lien externe ......................................... 7 2.2 GS1 .................................................. 7 2.2.1 Notes et références ...................................... 7 2.2.2 Annexes ........................................... 7 2.3 GS1 France .............................................. 7 2.3.1 Lien externe .......................................... 8 2.4 Comité français d'organisation et de normalisation bancaires ..................... 8 2.4.1 Liens externes ........................................ 8 2.5 Groupement pour l'amélioration des liaisons dans l'industrie automobile ............... 9 2.5.1 Présentation ......................................... 9 2.5.2 Notes et références .....................................
    [Show full text]
  • Unicode + PHP Ayesh Karunaratne Unicode + PHP Ayesh Karunaratne Security Researcher, Freelance Software Developer
    Unicode + PHP Ayesh Karunaratne Unicode + PHP Ayesh Karunaratne Security Researcher, Freelance Software Developer Kandy, Sri Lanka - Everywhere https://ayesh.me Ayesh @Ayeshlive Ayesh https://ayesh.me/talk/Unicode Morse Code Morse Code Represent English characters, numbers, with short and long pulses Morse Code Represent English characters, numbers, with short and long pulses • — Short Beep Long Beep Silence Morse Code Represent English characters, numbers, with short and long pulses A • • – Short Beep B – • • • — C Long Beep D – • – • – • • Silence E • T – Morse Code Represent English characters, numbers, with short and long pulses • Short Beep CAB — – • – • • – – • • • Long Beep TED Silence – • – • • Character Encoding How do you Encode Characters in a Computer System? How do you Encode Characters in a Computer System? 0 1 United States 1963 The First Official Photograph of the United States Senate in Session https://uschs.wordpress.com/2013/09/25/september-24-1963-the-first-official-photograph-of-the-united-states-senate-in-session ASCII American Standard Code for Information Interchange 0 1 Binary 26 25 24 23 22 21 20 1 0 0 0 0 0 1 Bit Binary 64 32 16 8 4 2 1 1 0 0 0 0 0 1 Bit Hexadecimal 161 160 7 F 4 bits ASCII Standard Hexadecimal 16 1 null 0 0 0 Binary 64 32 16 8 4 2 1 0 0 0 0 0 0 0 ASCII Standard Hexadecimal 16 1 A 65 4 1 Binary 64 32 16 8 4 2 1 1 0 0 0 0 0 1 ASCII Standard Hexadecimal 16 1 B 66 4 2 Binary 64 32 16 8 4 2 1 1 0 0 0 0 1 0 ASCII Standard Hexadecimal 16 1 C 67 4 3 Binary 64 32 16 8 4 2 1 1 0 0 0 0 1 1 ASCII Standard
    [Show full text]
  • Taiwanese...Kana?)
    A T TÂI OÂN GÍ KHÁ NAH (臺-灣-語 假-名) UCS Fredrick R. Brennan ベベ 孟ンン먆먆 ホホ ⎛ 福クク ⎞ レレクク 黎エエ먄먄 ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜ Psiĥedelisto ⎟⎟ ⎜⎜⎜ フレッド・ブレンナン ⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜copypastekittens.ph⎟⎟ ⎝ ⎠ 18 August 2020 ⽂字鏡研究会⼼感謝古家 時雄追悼 This document was typeset with SIL. A in no particular order… たに もと さち ひろ Sachihiro Tanimoto (⾕本玲⼤), Waseda University, Kokugakuin University For his patient explanation of the history of Mojikyō, and his priceless help in getting Mojikyō Character Map work- ing on my computer. Deborah Anderson, University of California @ Berkeley Script Encoding Initiative For her tireless review of script proposals by n00bs like me. み うら だい すけ Daisuke Miura (三浦⼤介), World Special-Characters Wiki (世界の特殊⽂字ウィキ) For his recommendation that I name the tone letters like M L K instead of just K as I originally planned; I did not know modier letters could be non-Latin, but he knew of the precedent of U+10FC —M L G N (ჼ). やま ぐち りゅう せい Ryūsei Yamaguchi (⼭⼝隆成) For his experience with Mojikyō, Unicode, and all around good advice. リ Wil Lee (Lí Kho-Lūn, 李イ먁 コ ヲ 科ヲ ル 潤ヌ먆), Patreon For kindly giving me a Taiwanese Hokkien name (also usable for Mandarin Chinese), for use in this proposal. こ ばやし けん Ken Lunde (⼩林 劍), Unicode Consortium For his font development advice, and helpful advice regarding Unihan. やま ざき いっ せい Issei Yamazaki (⼭崎⼀⽣) For helping me choose good shapes for the glyphs as a Japanese learner of Hokkien who writes in Taiwanese kana daily, and providing me with several dicult to nd resources.
    [Show full text]