Internationalization in Ruby 2.4
http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/
40th Internationalization and Unicode Conference
Santa Clara, California, U.S.A., November 3, 2016 Martin J. DÜRST
Aoyama Gakuin University
© 2016 Martin J. Dürst, Aoyama Gakuin University
Abstract
Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion.
Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8.
Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions.
We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.
This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.
For Best Viewing These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
Introduction
Introductions
Audience: Programming experience? Programming with Ruby/Rails? Internationalization/Globalization experience? Unicode knowledge? Speaker: From Switzerland, living in Japan Long-term Unicode/W3C/IUC involvement Ruby committer since 2007, mainly contributing Encoding conversion (String#encode, Ruby 1.9) Unicode normalization (String#unicode-normalize, Ruby 2.2) Non-ASCII case conversion (String#upcase,..., Ruby 2.4) Unicode version updates (Unicode 9.0 for Ruby 2.4)
Overview
Introduction Ruby Basics New in Ruby 2.4: Non-ASCII Case Conversion Implementation Details Lessons Learned and Future Work
Ruby Basics
Ruby Created by Yukihiro Matsumoto (Matz; since 1993) Easy for beginners, deep for experts Object-oriented throughout, but not obtrusive Extremely flexible Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework
Ruby Implementations
MRI (Matz's Ruby Implementation), aka C-Ruby available on many platforms (download for Windows) JRuby: Ruby on the JVM RubyMotion: Ruby for IOS, Android, and MacOS Opal: Ruby to JavaScript compiler Rubinius: Ruby (mostly) in Ruby A lot more ...
This tutorial is about MRI/C-Ruby, the reference implementation
Basic Ruby
3.times { puts 'Hello Ruby!' }
Hello Ruby! Hello Ruby! Hello Ruby! Everything is an object Methods can take blocks ({ ... } or do ... end) Unobtrusive syntax (no need for semicolons, ...)
Conventions Used in This Talk
Code is mostly green, monospace puts 'Hello Ruby!'
Variable parts are orange puts "some string"
Encoding is indicated with a subscript 코 'Юに δ'UTF-8, 'ユニコード'SJIS Results are indicated with " " 1 + 1 2
코 Frequent Example Юに δ
Ю: Cyrillic uppercase YU に: Hiragana NI 코: Hangul KO δ: Greek delta
Up and Running
Install Ruby Open a UTF-8 based console Easy on Mac and Linux On Windows: Cygwin Terminal, PuTTY, ..., or command prompt with chcp 65001 Start irb (Interactive Ruby) Type in Ruby commands
String Basics
Strings are sequences of characters: (codepoints) 코 "Юに δ".length 4 We can get a byte count with: 코 "Юに δ".bytesize 10 They are instances of class String: 코 "Юに δ".class String Characters are strings of length 1: 코 "Юに δ"[0] "Ю"; 코 "Юに δ"[0].length 1 Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.
Encoding Basics
Earch String has an encoding Strings with different encodings can't be mixed 코 코 'Юに δ'UTF-8 + 'Юに δ'UTF-16 Encoding::CompatibilityError Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. There are some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is that transcoding should not happen without the programmer being aware of it.
'Dürst'ISO-8859-1 == 'Dürst'ISO-8859-2 false
Trying to compare two character-by-character identical strings in different encodings will produce false, even if these strings are, as in the above example, also byte-for-byte identical. Again, the reason for the result is that encoding mismatches should be detected early. In addition, a simple byte-for-byte comparison could produce false positives.
except if their content is ASCII-only (bytes)
'abc'ISO-8859-1 == 'abc'Shift_JIS true
Just use Unicode, just use UTF-8
Ruby Likes UTF-8
Default for source encoding (since Ruby 2.0) (no need for # encoding UTF-8 encoding pragma) Encoding of strings with \u escapes is always UTF-8
"abc\u03B4" 'abcδ'UTF-8
Use -U option if not in an UTF-8 context: ruby -U myscript.rb Processing of UTF-8 is optimized where possible Used out of the box by Ruby on Rails Transcoding available on input/output The only (internal) encoding in Ruby 3.0 or 4.0 (speculation!)
Ruby Versions
Ruby ≤1.8: RIP (Strings as byte sequences) Ruby 1.9 and later (Strings as character sequences) Ruby 2.0: UTF-8 default source encoding Ruby 2.2: Unicode normalization added (2014) Ruby 2.3: Newest published version Ruby 2.4: Release planned for Christmas 2016, non-ASCII case conversion
Ruby Versions and Unicode Versions
Year (y) Ruby version (VRuby) Unicode version (VUnicode) published around Christmas published in Summer 2014 2.2 7.0.0 2015 2.3 8.0.0 2016 2.4 9.0.0
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.
RbConfig::CONFIG["UNICODE_VERSION"] '9.0.0'
VUnicode = y - 2007
VRuby = 1.5 + VUnicode · 0.1
VUnicode = VRuby · 10 - 15
Don't extrapolate too far!
New in Ruby 2.4:
Non-ASCII Case Conversion Case Conversions Functions in Ruby
'Unicode Everywhere'.upcase 'UNICODE EVERYWHERE' 'Unicode Everywhere'.downcase 'unicode everywhere' 'Unicode Everywhere'.capitalize 'Unicode everywhere' 'Unicode Everywhere'.swapcase 'uNICODE eVERYWHERE'
Case Conversion in Ruby 2.3 'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'
Case Conversions NOT in Ruby 2.3
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ' Case Conversion up to and including Ruby 2.3 is ASCII-only!
Case Conversions NOT in Ruby 2.3
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ' But in Ruby 2.4!
Case Conversion Around the World
Many more Latin letters than just A-Z Other scripts: Cyrillic, Greek Coptic, Armenian [, Georgian] Cherokee, Deseret, Osage Old Hungarian, Warang Citi, Glagolitic, Adlam More minority scripts may introduce case distinction from surrounding majority scripts
Case Distinction History
Originally: Style difference, depending on medium Upper case for stone inscriptions (SPQR) Lower case for wax tablets,...? Functional distinction since ~15th century
Modern Case Usage (details vary by language)
ALL UPPER CASE EMPHASIS Acronyms, abbreviations (DRY, SQL) First letter upper case Start of sentence Words in titles Proper nouns/adjectives (Kyoto, Japanese) Nouns Honorifics Lower case: everything else
German: der Gefangene floh - the prisoner fled, but der gefangene Floh - the captive flea
Isn't ASCII-only Case Conversion Enough?
Already in other languages (Python, Perl, Java, ...) Already in Ruby (Regexp: //i) Algorithms and data is available from Unicode Consortium It's a good idea in general
But: Backwards Compatibility?
Idea: Option for new functionality 'Résumé'.upcase 'RéSUMé' 'Résumé'.upcase :unicode 'RÉSUMÉ' Matz felt option was not necessary Lots of data is ASCII-only For non-ASCII data, you hopefully used a gem (which you can now eliminate) Check early grep your code base for upcase and friends Test early (preview 2 of Ruby 2.4)
Backwards Compatibility Problems
Explicit ASCII-only case conversion
E.g. DNS servers (but you used Encoding::ASCII_8BIT there anyway?!) Exact matches after conversion 1. Allowed non-ASCII in userids (e.g. Соколов)
2. downcased with Ruby 2.3 to help users (Соколов in DB) 3. Used exact match 4. In Ruby 2.4, соколов will not match Соколов anymore Localization: See Turkic, Lithuanian special cases
Backwards Compatibility: :ascii Option Use if you find a case where you really don't want to convert non-ASCII characters
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'
Implementation Choices
Use a library?
Pure Ruby: UnicodeUtils ActiveSupport::Multibyte TwitterCLDR C extensions: ICU as a gem: icu, ffi-icu
Integrate IUC?
Write new code?
Implementation Choices
Use a library?
Different interface if used directly Not efficient if in pure Ruby Data duplication
Integrate IUC?
IUC and Ruby both have their own low-level idea of strings Write new code?
That's what we ended up doing
Where to Get the Data From?
Data and other specifications available from the Unicode Consortium:
UnicodeData.txt
CaseFolding.txt
SpecialCasing.txt
Special Cases: Not 1-to-1
Number of characters not preserved 'ß'.upcase 'SS' (German sz/sharp s) ' '.upcase "FFI" ( ligature) Not necessarily reversible 'ß'.upcase.downcase 'ß' 'ss' 'σ'.upcase 'Σ' (Greek sigma) 'ς'.upcase 'Σ' (Greek final sigma) 'ς'.upcase.downcase 'ς' 'σ' Implemented! 'Σ'.downcase should be context-dependent Not yet implemented!
Special Case: Simple Case Mapping
Defined by Unicode Excludes mappings that change string length Feels outdated
Not implemented!
Special Case: Turkic
Usual: 'i'.upcase 'I' 'I'.upcase 'i' Turkish, Azerbaijani, and related languages when written in Latin script 'i'.upcase 'İ' (uppercase I with dot) 'İ'.downcase 'i' 'ı'.upcase 'I' (i without dot) 'I'.downcase 'ı' Implemented! 'Türkiye'.upcase :turkic 'TÜRKİYE'
Special Case: Lithuanian
Usual:
'Í'.downcase 'í' (accent replaces dot) Lithuanian:
'Í'.downcase :lithuanian 'í' (accent above visible dot; may not show because of technology limits)
Not yet implemented!
Special Case: Case Folding
Case mapping: Change from one form to another upcase/downcase/capitalize/swapcase Case folding Eliminate case-related differences For comparison, sorting In general same as downcase But: ß → ss, → ffi, ς → σ Upcase for Cherokee Implemented! with :fold option on downcase
'ß'.downcase :fold 'ss' ' '.downcase :fold 'ffi' 'ς'.downcase :fold 'σ'
Special Case: Titlecase
Some characters have three case forms: Upper case: DŽ (Croatian/Serbian) Lower case: dž Title case: Dž Important for capitalize 'džungla'.capitalize 'DŽungla' 'džungla'.capitalize 'Džungla' Implemented!
More Special Cases
Contextual processing, e.g. for i with combining dots (part of Unicode algorithm definition) German uppercase ß (not part of Unicode algorithm definition) others,... Not implemented (yet?)
Implementation
12 Methods to Implement
String (functional) String (destructive) Symbol upcase upcase! upcase downcase downcase! downcase capitalize capitalize! capitalize swapcase swapcase! swapcase
Not dealt with: String#casecmp Why: Includes sorting
Internally, a Single Function
Flags to indicate operation needed (in file include/ruby/oniguruma.h):
#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */ #define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */ #define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */ Usage to indicate operation type: upcase: ONIGENC_CASE_UPCASE (upcasing needed) downcase: ONIGENC_CASE_DOWNCASE (downcasing needed) capitalize: ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE (changed to ONIGENC_CASE_DOWNCASE after first character) swapcase: ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE (both upcasing and downcasing needed)
Option Handling
Flags also used for options:
:fold (for case folding; only on downcase) :turkic :lithuanian (not yet implemented) :ascii Corresponding flags:
#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * / #define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */ #define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */ #define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */
String Expansion
Handles string expansion (e.g. " ".upcase "FFI") Common to all casing operations
Linked list of buffers (b1→b2→b3→...) Repeatedly calls encoding-specific primitive to fill as much as possible of next buffer For buffer bx, allocates bytes_to_still_be_converted · x + 20 bytes Example: We need a 3rd buffer, and need to convert 5 more bytes, so we allocate length(b3) = 5 · 3 + 20 = 35 bytes Until no new buffer is needed
Handling Encodings: The Ruby Way
Each encoding is implemented by a series of primitives Work like methods (polymorphism), but implemented in C Total of 13 primitives per encoding Example primitives: Length of character at current byte position Advance byte position by one character Codepoint of character at current byte position Insert codepoint x at current byte position
[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)
Implementation Choice: UTF-8 only or Primitives
Matz would have been fine with Full Unicode case conversion for UTF-8 ASCII-only for all other encodings Actually used primitives to obtain A more complete implementation Experience about pros/cons of using primitives
Implementation Choice: New or Reused Primitive
3 primitives are used for case folding with regular expressions (//i) mbc_case_fold apply_all_case_fold get_case_fold_codes_by_str Found no good way to reuse any of these
New primitive
But found a lot of reusable data
The case_map Primitive Input/output parameters: OnigCaseFoldType flags Start of source Input parameters: End of source Start of destination End of destination Encoding (to call other primitives) Output parameters: Byte count of conversion result (negative for errors) Most complex 'primitive', although not by much
Implementations of case_map Primitive Examples:
"Résumé"UTF-8.upcase calls onigenc_unicode_case_map in enc/unicode.c (most complex case) as defined with OnigEncodingDefine in enc/utf_8.c "Résumé"UTF-16LE.upcase calls onigenc_unicode_case_map in enc/unicode.c as defined with OnigEncodingDefine in enc/utf_16le.c "Résumé"ISO-8859-1.upcase calls case_map in enc/iso_8859_1.c (simple case, good starting point for primitive for new encoding) as defined with OnigEncodingDefine in the same file
The Primitive of Primitives: onigenc_unicode_case_map Works for UTF-8, UTF-16[BE|LE], UTF-32[BE|LE] 140 lines long 'monster function' Same structure as simpler primitives: Big while loop, one source character a time Carefully updating ONIGENC_CASE_MODIFIED flag Deal with special cases 'by hand' Reuse existing data where possible ~30 if/else if/else Lots of |/& with flag bits 2 gotos gperf-created hash lookups: onigenc_unicode_fold_lookup onigenc_unicode_unfold1_lookup
More case_map Primitives Students (sophomores/juniors/seniors) at Aoyama Gakuin University
ISO-8859-2: Yushiro Ishii (石井 優史朗) ISO-8859-3: Kanon Shindo (新藤 海音) ISO-8859-4: Kotaro Yoshida (吉田 孝太郎) ISO-8859-5: Masaru Onodera (小野寺 俊) ISO-8859-7: Kosuke Kurihara (栗原 光祐) ISO-8859-9: Kazuki Iijima (飯島 一貴) ISO-8859-10: Toya Hosokawa (細川 登陽) ISO-8859-13: Takuya Miyamoto (宮本 拓弥) ISO-8859-14: Yutaro Tada (多田 悠太朗) ISO-8859-15: Maho Harada (原田 真帆) ISO-8859-16: Satoshi Kayama (香山 智志) Windows-1250, -1257: Sho Koike (小池 翔) Windows-1251: Shunsuke Sato (佐藤 駿介) Windows-1252: Serina Tai (田井 芹奈) Windows-1253: Takumi Koyama (小山 拓美)
So What about Shift_JIS and Friends?
For East Asian encodings (Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...) data could be shared between //i and case mapping but case folding for //i only works for ASCII None of the main Japanese committers thought this was needed anymore
Talk to me if you need it
Reusing Case Folding Data
Onig[uruma|gmo] has data for case folding Folding is very close to downcase There is also unfolding (why?), which is close to upcase That's almost all we need
Folding Data: Before and After in enc/unicode/9.0.0/casefold.h
/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* → (Cherokee) */ /* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* → (Cherokee) */
Folding Data: Flags
(squeezed into an int where only 2 bits were used) see enc/unicode.c
/* data is available here */ /* (flags are the same as for options) */ #define U ONIGENC_CASE_UPCASE #define D ONIGENC_CASE_DOWNCASE #define F ONIGENC_CASE_FOLD /* data is in special additional array */ #define ST ONIGENC_CASE_TITLECASE #define SU ONIGENC_CASE_UP_SPECIAL #define SL ONIGENC_CASE_DOWN_SPECIAL #define IT ONIGENC_CASE_IS_TITLECASE /* index into special array (size: around 420 words only) */ #define I(n) OnigSpecialIndexEncode(n)
Small Implementation Detail
(or my attempt at using the Takahashi method)
upcase seems useful downcase seems useful
capitalize seems useful
swapcase
Who would use swapcase?
Nobody?
Nobody?
Well, I did, when testing swapcase!
Why swapcase?
Why swapcase?
Python has it ?! (Matz)
Why swapcase?
Python has it ?! (Matz)
To revert accidental Caps Lock output ?! (on Unicode list)
implementing swapcase must be easy UPPER upper lower LOWER But what about titlecase?
Dz, Dž, Lj, Nj ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ
Choice 1 "DžunGLA".swapcase leave as is "DžUNgla"
preferred by Unicode Consortium (never ever need any new standardization)
preserves reversibility (X.swapcase.swapcase == X)
Choice 2 "DžunGLA".swapcase upcase "DŽUNgla"
Choice 3 "DžunGLA".swapcase downcase "džUNgla"
Choice 4 "DžunGLA".swapcase swap "dŽUNgla"
proposed by Nobuyoshi Nakada
Implemented swap "dŽUNgla"
useless?, but 'correct' additional effort for implementation additional effort for testing
Commit Date April 1st, 2016
(エイプリルフールの日) Japan Time 20:58:33 same date in most timezones please draw your own conclusions
Testing
Test-Driven Development
Write small example test Verify that it doesn't work Implement Enjoy that it works Rinse and repeat
Files: test/ruby/enc/test_case_options.rb test/ruby/enc/test_case_mapping.rb Data-Driven Testing
Test every character (except for ranges in UnicodeData.txt) of every encoding for all option combinations for (almost) all methods Data provided by Unicode Identical to data used for implementation ?!
Files: test/ruby/enc/test_case_comprehensive.rb 413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips
Continuous Integration
Commit early, commit often Advice (and scolding) from hardcore Ruby hackers Keep code reasonably clean, and motivation high More commits → higher chance to attend Ruby Kaigi for free But: Don't want to affect Ruby build or execution Solution: Make use of new functionality dependent on special option Used :lithuanian (because last to be actually implemented) Test with option protection Remove option protection
Future:
Ideas, Problems, Questions
In No Particular Order
Character properties Locale-aware formatting What to do with encodings?
Character Properties
Unicode provides a wide range of character properties Most available in Regexp Does this string contain a Hiragana character? 코 'Юに δ' =~ /\p{Hiragana}/ What script is 'Ю'? sorry, impossible! 不可能! Currently looking at this with a student, hopefully For Ruby ~2.5 Use less memory Faster More properties More ways to use
Locale-Aware Formatting
What I want: loc = Locale.new 'de-CH' (German as used in Switzerland)
1.2345678E5.to_s "123456.78"
1.2345678E5.to_s(loc) "123'456,78"
Well, Just use a Library
Internationalization support in libraries:
Pure Ruby: UnicodeUtils ActiveSupport::Multibyte TwitterCLDR C extensions: ICU as a gem: icu, ffi-icu
Example: Unicode Normalization
UnicodeUtils
UnicodeUtils.nfkc string ActiveSupport::Multibyte
ActiveSupport::Multibyte::Chars.new(string).normalize :kc TwitterCLDR
TwitterCldr::Normalization::NFKC.normalize string Native (since Ruby 2.2) string.unicode_normalize :nfkc Libraries avoid monkey patching
not Ruby-like (ライブラリを使うと Ruby らしくない)
Locales and Case Mappings
Possible solution (解決案): loc = Locale.new 'tr' 'Türkiye'.upcase loc 'TÜRKİYE'
Encodings: Less is More?
We discovered flaky support for current encodings (//i case folding: all encodings not at end of test/ruby/enc/test_regex_casefold.rb) The world is moving to Unicode Matz wants to move to UTF-8, slowly but steadily Do we let other encodings die slowly? Or get rid of them in a single step (Ruby3.0?)
Acknowledgments
Kimihito Matsui (松井 仁人) and many other students for help with research and implementations Yui Naruse (成瀬 ゆい), Nobuyoshi Nakada (中田 伸悦) and many other Ruby committers for help and support Matz (まつもと ゆきひろ) for Ruby, a programmer's best friend Amaya, Opera 12.17, and coderay for slide production and display The IME Pad for easy character input
Conclusions Full Unicode case mapping (mostly) implemented Options for backward compatibility, special conventions, case folding Space efficient implementation by reusing Regexp data Available in Ruby trunk now, please test! More internationalization work needed Tell me what you want most
References
More information about case conversion implementation internals: http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/ (video at http://rubykaigi.org/2016/presentations/duerst.html)
Q & A
Send questions and comments to Martin Dürst (mailto:[email protected]) or open a bug report or feature request for Ruby
The latest version of this presentation is available at: http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/