Internationalization in Ruby 2.4
Total Page:16
File Type:pdf, Size:1020Kb
Internationalization in Ruby 2.4 http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/ 40th Internationalization and Unicode Conference Santa Clara, California, U.S.A., November 3, 2016 Martin J. DÜRST [email protected] Aoyama Gakuin University © 2016 Martin J. Dürst, Aoyama Gakuin University Abstract Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion. Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8. Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions. We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described. This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general. For Best Viewing These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed. Introduction Introductions Audience: Programming experience? Programming with Ruby/Rails? Internationalization/Globalization experience? Unicode knowledge? Speaker: From Switzerland, living in Japan Long-term Unicode/W3C/IUC involvement Ruby committer since 2007, mainly contributing Encoding conversion (String#encode, Ruby 1.9) Unicode normalization (String#unicode-normalize, Ruby 2.2) Non-ASCII case conversion (String#upcase,..., Ruby 2.4) Unicode version updates (Unicode 9.0 for Ruby 2.4) Overview Introduction Ruby Basics New in Ruby 2.4: Non-ASCII Case Conversion Implementation Details Lessons Learned and Future Work Ruby Basics Ruby Created by Yukihiro Matsumoto (Matz; since 1993) Easy for beginners, deep for experts Object-oriented throughout, but not obtrusive Extremely flexible Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation), aka C-Ruby available on many platforms (download for Windows) JRuby: Ruby on the JVM RubyMotion: Ruby for IOS, Android, and MacOS Opal: Ruby to JavaScript compiler Rubinius: Ruby (mostly) in Ruby A lot more ... This tutorial is about MRI/C-Ruby, the reference implementation Basic Ruby 3.times { puts 'Hello Ruby!' } Hello Ruby! Hello Ruby! Hello Ruby! Everything is an object Methods can take blocks ({ ... } or do ... end) Unobtrusive syntax (no need for semicolons, ...) Conventions Used in This Talk Code is mostly green, monospace puts 'Hello Ruby!' Variable parts are orange puts "some string" Encoding is indicated with a subscript 코 'Юに δ'UTF-8, 'ユニコード'SJIS Results are indicated with " " 1 + 1 2 코 Frequent Example Юに δ Ю: Cyrillic uppercase YU に: Hiragana NI 코: Hangul KO δ: Greek delta Up and Running Install Ruby Open a UTF-8 based console Easy on Mac and Linux On Windows: Cygwin Terminal, PuTTY, ..., or command prompt with chcp 65001 Start irb (Interactive Ruby) Type in Ruby commands String Basics Strings are sequences of characters: (codepoints) 코 "Юに δ".length 4 We can get a byte count with: 코 "Юに δ".bytesize 10 They are instances of class String: 코 "Юに δ".class String Characters are strings of length 1: 코 "Юに δ"[0] "Ю"; 코 "Юに δ"[0].length 1 Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing. Encoding Basics Earch String has an encoding Strings with different encodings can't be mixed 코 코 'Юに δ'UTF-8 + 'Юに δ'UTF-16 Encoding::CompatibilityError Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. There are some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is that transcoding should not happen without the programmer being aware of it. 'Dürst'ISO-8859-1 == 'Dürst'ISO-8859-2 false Trying to compare two character-by-character identical strings in different encodings will produce false, even if these strings are, as in the above example, also byte-for-byte identical. Again, the reason for the result is that encoding mismatches should be detected early. In addition, a simple byte-for-byte comparison could produce false positives. except if their content is ASCII-only (bytes) 'abc'ISO-8859-1 == 'abc'Shift_JIS true Just use Unicode, just use UTF-8 Ruby Likes UTF-8 Default for source encoding (since Ruby 2.0) (no need for # encoding UTF-8 encoding pragma) Encoding of strings with \u escapes is always UTF-8 "abc\u03B4" 'abcδ'UTF-8 Use -U option if not in an UTF-8 context: ruby -U myscript.rb Processing of UTF-8 is optimized where possible Used out of the box by Ruby on Rails Transcoding available on input/output The only (internal) encoding in Ruby 3.0 or 4.0 (speculation!) Ruby Versions Ruby ≤1.8: RIP (Strings as byte sequences) Ruby 1.9 and later (Strings as character sequences) Ruby 2.0: UTF-8 default source encoding Ruby 2.2: Unicode normalization added (2014) Ruby 2.3: Newest published version Ruby 2.4: Release planned for Christmas 2016, non-ASCII case conversion Ruby Versions and Unicode Versions Year (y) Ruby version (VRuby) Unicode version (VUnicode) published around Christmas published in Summer 2014 2.2 7.0.0 2015 2.3 8.0.0 2016 2.4 9.0.0 A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions. RbConfig::CONFIG["UNICODE_VERSION"] '9.0.0' VUnicode = y - 2007 VRuby = 1.5 + VUnicode · 0.1 VUnicode = VRuby · 10 - 15 Don't extrapolate too far! New in Ruby 2.4: Non-ASCII Case Conversion Case Conversions Functions in Ruby 'Unicode Everywhere'.upcase 'UNICODE EVERYWHERE' 'Unicode Everywhere'.downcase 'unicode everywhere' 'Unicode Everywhere'.capitalize 'Unicode everywhere' 'Unicode Everywhere'.swapcase 'uNICODE eVERYWHERE' Case Conversion in Ruby 2.3 'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ' Case Conversions NOT in Ruby 2.3 'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ' Case Conversion up to and including Ruby 2.3 is ASCII-only! Case Conversions NOT in Ruby 2.3 'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ' But in Ruby 2.4! Case Conversion Around the World Many more Latin letters than just A-Z Other scripts: Cyrillic, Greek Coptic, Armenian [, Georgian] Cherokee, Deseret, Osage Old Hungarian, Warang Citi, Glagolitic, Adlam More minority scripts may introduce case distinction from surrounding majority scripts Case Distinction History Originally: Style difference, depending on medium Upper case for stone inscriptions (SPQR) Lower case for wax tablets,...? Functional distinction since ~15th century Modern Case Usage (details vary by language) ALL UPPER CASE EMPHASIS Acronyms, abbreviations (DRY, SQL) First letter upper case Start of sentence Words in titles Proper nouns/adjectives (Kyoto, Japanese) Nouns Honorifics Lower case: everything else German: der Gefangene floh - the prisoner fled, but der gefangene Floh - the captive flea Isn't ASCII-only Case Conversion Enough? Already in other languages (Python, Perl, Java, ...) Already in Ruby (Regexp: //i) Algorithms and data is available from Unicode Consortium It's a good idea in general But: Backwards Compatibility? Idea: Option for new functionality 'Résumé'.upcase 'RéSUMé' 'Résumé'.upcase :unicode 'RÉSUMÉ' Matz felt option was not necessary Lots of data is ASCII-only For non-ASCII data, you hopefully used a gem (which you can now eliminate) Check early grep your code base for upcase and friends Test early (preview 2 of Ruby 2.4) Backwards Compatibility Problems Explicit ASCII-only case conversion E.g. DNS servers