<<

Internationalization in Ruby 2.4

http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/

40th Internationalization and Conference

Santa Clara, California, U.S.A., November 3, 2016 Martin J. DÜRST

[email protected]

Aoyama Gakuin University

© 2016 Martin J. Dürst, Aoyama Gakuin University

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8.

Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

For Best Viewing These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Introduction

Introductions

Audience: Programming experience? Programming with Ruby/Rails? Internationalization/Globalization experience? Unicode knowledge? Speaker: From Switzerland, living in Japan Long-term Unicode/W3C/IUC involvement Ruby committer since 2007, mainly contributing Encoding conversion (String#encode, Ruby 1.9) Unicode normalization (String#unicode-normalize, Ruby 2.2) Non-ASCII case conversion (String#upcase,..., Ruby 2.4) Unicode version updates (Unicode 9.0 for Ruby 2.4)

Overview

Introduction Ruby Basics New in Ruby 2.4: Non-ASCII Case Conversion Implementation Details Lessons Learned and Future Work

Ruby Basics

Ruby Created by (Matz; since 1993) Easy for beginners, deep for experts Object-oriented throughout, but not obtrusive Extremely flexible Particularly strong for (internal) DSLs and metaprogramming Used for Web Framework

Ruby Implementations

MRI (Matz's Ruby Implementation), aka -Ruby available on many platforms (download for Windows) JRuby: Ruby on the JVM RubyMotion: Ruby for IOS, Android, and MacOS Opal: Ruby to JavaScript compiler : Ruby (mostly) in Ruby A lot more ...

This tutorial is about MRI/C-Ruby, the reference implementation

Basic Ruby

3.times { puts 'Hello Ruby!' }

Hello Ruby! Hello Ruby! Hello Ruby! Everything is an object Methods can take blocks ({ ... } or do ... end) Unobtrusive syntax (no need for semicolons, ...)

Conventions Used in This Talk

Code is mostly green, monospace puts 'Hello Ruby!'

Variable parts are orange puts "some string"

Encoding is indicated with a subscript 코 'Юに δ'UTF-8, 'ユニコード'SJIS Results are indicated with " " 1 + 1 2

코 Frequent Example Юに δ

Ю: Cyrillic uppercase YU に: NI 코: KO δ: Greek delta

Up and Running

Install Ruby Open a UTF-8 based console Easy on Mac and Linux On Windows: Cygwin Terminal, PuTTY, ..., or command prompt with chcp 65001 Start irb (Interactive Ruby) Type in Ruby commands

String Basics

Strings are sequences of characters: (codepoints) 코 "Юに δ".length 4 We can get a byte count with: 코 "Юに δ".bytesize 10 They are instances of class String: 코 "Юに δ".class String Characters are strings of length 1: 코 "Юに δ"[0] "Ю"; 코 "Юに δ"[0].length 1 Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.

Encoding Basics

Earch String has an encoding Strings with different encodings can't be mixed 코 코 'Юに δ'UTF-8 + 'Юに δ'UTF-16 Encoding::CompatibilityError Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. There are some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is that transcoding should not happen without the being aware of it.

'Dürst'ISO-8859-1 == 'Dürst'ISO-8859-2 false

Trying to compare two character-by-character identical strings in different encodings will produce false, even if these strings are, as in the above example, also byte-for-byte identical. Again, the reason for the result is that encoding mismatches should be detected early. In addition, a simple byte-for-byte comparison could produce false positives.

except if their content is ASCII-only (bytes)

'abc'ISO-8859-1 == 'abc'Shift_JIS true

Just use Unicode, just use UTF-8

Ruby Likes UTF-8

Default for source encoding (since Ruby 2.0) (no need for # encoding UTF-8 encoding pragma) Encoding of strings with \u escapes is always UTF-8

"abc\u03B4" 'abcδ'UTF-8

Use -U option if not in an UTF-8 context: ruby -U myscript.rb Processing of UTF-8 is optimized where possible Used out of the box by Ruby on Rails Transcoding available on input/output The only (internal) encoding in Ruby 3.0 or 4.0 (speculation!)

Ruby Versions

Ruby ≤1.8: RIP (Strings as byte sequences) Ruby 1.9 and later (Strings as character sequences) Ruby 2.0: UTF-8 default source encoding Ruby 2.2: Unicode normalization added (2014) Ruby 2.3: Newest published version Ruby 2.4: Release planned for Christmas 2016, non-ASCII case conversion

Ruby Versions and Unicode Versions

Year (y) Ruby version (VRuby) Unicode version (VUnicode) published around Christmas published in Summer 2014 2.2 7.0.0 2015 2.3 8.0.0 2016 2.4 9.0.0

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"] '9.0.0'

VUnicode = y - 2007

VRuby = 1.5 + VUnicode · 0.1

VUnicode = VRuby · 10 - 15

Don't extrapolate too far!

New in Ruby 2.4:

Non-ASCII Case Conversion Case Conversions Functions in Ruby

'Unicode Everywhere'.upcase 'UNICODE EVERYWHERE' 'Unicode Everywhere'.downcase 'unicode everywhere' 'Unicode Everywhere'.capitalize 'Unicode everywhere' 'Unicode Everywhere'.swapcase 'uNICODE eVERYWHERE'

Case Conversion in Ruby 2.3 'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ' Case Conversion up to and including Ruby 2.3 is ASCII-only!

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ' But in Ruby 2.4!

Case Conversion Around the World

Many more Latin letters than just A-Z Other scripts: Cyrillic, Greek Coptic, Armenian [, Georgian] Cherokee, Deseret, Osage Old Hungarian, Warang Citi, Glagolitic, Adlam More minority scripts may introduce case distinction from surrounding majority scripts

Case Distinction History

Originally: Style difference, depending on medium Upper case for stone inscriptions (SPQR) Lower case for wax tablets,...? Functional distinction since ~15th century

Modern Case Usage (details vary by language)

ALL UPPER CASE EMPHASIS Acronyms, abbreviations (DRY, SQL) First letter upper case Start of sentence Words in titles Proper nouns/adjectives (Kyoto, Japanese) Nouns Honorifics Lower case: everything else

German: der Gefangene floh - the prisoner fled, but der gefangene Floh - the captive flea

Isn't ASCII-only Case Conversion Enough?

Already in other languages (Python, Perl, Java, ...) Already in Ruby (Regexp: //i) Algorithms and data is available from It's a good idea in general

But: Backwards Compatibility?

Idea: Option for new functionality 'Résumé'.upcase 'RéSUMé' 'Résumé'.upcase :unicode 'RÉSUMÉ' Matz felt option was not necessary Lots of data is ASCII-only For non-ASCII data, you hopefully used a gem (which you can now eliminate) Check early grep your code base for upcase and friends Test early (preview 2 of Ruby 2.4)

Backwards Compatibility Problems

Explicit ASCII-only case conversion

E.g. DNS servers (but you used Encoding::ASCII_8BIT there anyway?!) Exact matches after conversion 1. Allowed non-ASCII in userids (e.g. Соколов)

2. downcased with Ruby 2.3 to help users (Соколов in DB) 3. Used exact match 4. In Ruby 2.4, соколов will not match Соколов anymore Localization: See Turkic, Lithuanian special cases

Backwards Compatibility: :ascii Option Use if you find a case where you really don't want to convert non-ASCII characters

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

Implementation Choices

Use a library?

Pure Ruby: UnicodeUtils ActiveSupport::Multibyte TwitterCLDR C extensions: ICU as a gem: icu, ffi-icu

Integrate IUC?

Write new code?

Implementation Choices

Use a library?

Different interface if used directly Not efficient if in pure Ruby Data duplication

Integrate IUC?

IUC and Ruby both have their own low-level idea of strings Write new code?

That's what we ended up doing

Where to Get the Data From?

Data and other specifications available from the Unicode Consortium:

UnicodeData.txt

CaseFolding.txt

SpecialCasing.txt

Special Cases: Not 1-to-1

Number of characters not preserved 'ß'.upcase 'SS' (German sz/sharp s) ' '.upcase "FFI" ( ligature) Not necessarily reversible 'ß'.upcase.downcase 'ß' 'ss' 'σ'.upcase 'Σ' (Greek sigma) 'ς'.upcase 'Σ' (Greek final sigma) 'ς'.upcase.downcase 'ς' 'σ' Implemented! 'Σ'.downcase should be context-dependent Not yet implemented!

Special Case: Simple Case Mapping

Defined by Unicode Excludes mappings that change string length Feels outdated

Not implemented!

Special Case: Turkic

Usual: 'i'.upcase 'I' 'I'.upcase 'i' Turkish, Azerbaijani, and related languages when written in Latin 'i'.upcase 'İ' (uppercase I with dot) 'İ'.downcase 'i' 'ı'.upcase 'I' (i without dot) 'I'.downcase 'ı' Implemented! 'Türkiye'.upcase :turkic 'TÜRKİYE'

Special Case: Lithuanian

Usual:

'Í'.downcase 'í' (accent replaces dot) Lithuanian:

'Í'.downcase :lithuanian 'í' (accent above visible dot; may not show because of technology limits)

Not yet implemented!

Special Case: Case Folding

Case mapping: Change from one form to another upcase/downcase/capitalize/swapcase Case folding Eliminate case-related differences For comparison, sorting In general same as downcase But: ß → ss, → ffi, ς → σ Upcase for Cherokee Implemented! with :fold option on downcase

'ß'.downcase :fold 'ss' ' '.downcase :fold 'ffi' 'ς'.downcase :fold 'σ'

Special Case: Titlecase

Some characters have three case forms: Upper case: DŽ (Croatian/Serbian) Lower case: dž Title case: Dž Important for capitalize 'džungla'.capitalize 'DŽungla' 'džungla'.capitalize 'Džungla' Implemented!

More Special Cases

Contextual processing, e.g. for i with combining dots (part of Unicode algorithm definition) German uppercase ß (not part of Unicode algorithm definition) others,... Not implemented (yet?)

Implementation

12 Methods to Implement

String (functional) String (destructive) Symbol upcase upcase! upcase downcase downcase! downcase capitalize capitalize! capitalize swapcase swapcase! swapcase

Not dealt with: String#casecmp Why: Includes sorting

Internally, a Single Function

Flags to indicate operation needed (in file include/ruby/oniguruma.h):

#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */ #define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */ #define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */ Usage to indicate operation type: upcase: ONIGENC_CASE_UPCASE (upcasing needed) downcase: ONIGENC_CASE_DOWNCASE (downcasing needed) capitalize: ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE (changed to ONIGENC_CASE_DOWNCASE after first character) swapcase: ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE (both upcasing and downcasing needed)

Option Handling

Flags also used for options:

:fold (for case folding; only on downcase) :turkic :lithuanian (not yet implemented) :ascii Corresponding flags:

#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * / #define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */ #define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */ #define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */

String Expansion

Handles string expansion (e.g. " ".upcase "FFI") Common to all casing operations

Linked list of buffers (b1→b2→b3→...) Repeatedly calls encoding-specific primitive to fill as much as possible of next buffer For buffer bx, allocates bytes_to_still_be_converted · x + 20 bytes Example: We need a 3rd buffer, and need to convert 5 more bytes, so we allocate length(b3) = 5 · 3 + 20 = 35 bytes Until no new buffer is needed

Handling Encodings: The Ruby Way

Each encoding is implemented by a series of primitives Work like methods (polymorphism), but implemented in C Total of 13 primitives per encoding Example primitives: Length of character at current byte position Advance byte position by one character Codepoint of character at current byte position Insert codepoint x at current byte position

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)

Implementation Choice: UTF-8 only or Primitives

Matz would have been fine with Full Unicode case conversion for UTF-8 ASCII-only for all other encodings Actually used primitives to obtain A more complete implementation Experience about pros/cons of using primitives

Implementation Choice: New or Reused Primitive

3 primitives are used for case folding with regular expressions (//i) mbc_case_fold apply_all_case_fold get_case_fold_codes_by_str Found no good way to reuse any of these

New primitive

But found a lot of reusable data

The case_map Primitive Input/output parameters: OnigCaseFoldType flags Start of source Input parameters: End of source Start of destination End of destination Encoding (to call other primitives) Output parameters: Byte count of conversion result (negative for errors) Most complex 'primitive', although not by much

Implementations of case_map Primitive Examples:

"Résumé"UTF-8.upcase calls onigenc_unicode_case_map in enc/unicode.c (most complex case) as defined with OnigEncodingDefine in enc/utf_8.c "Résumé"UTF-16LE.upcase calls onigenc_unicode_case_map in enc/unicode.c as defined with OnigEncodingDefine in enc/utf_16le.c "Résumé"ISO-8859-1.upcase calls case_map in enc/iso_8859_1.c (simple case, good starting point for primitive for new encoding) as defined with OnigEncodingDefine in the same file

The Primitive of Primitives: onigenc_unicode_case_map Works for UTF-8, UTF-16[BE|LE], UTF-32[BE|LE] 140 lines long 'monster function' Same structure as simpler primitives: Big while loop, one source character a time Carefully updating ONIGENC_CASE_MODIFIED flag Deal with special cases 'by hand' Reuse existing data where possible ~30 if/else if/else Lots of |/& with flag bits 2 gotos gperf-created hash lookups: onigenc_unicode_fold_lookup onigenc_unicode_unfold1_lookup

More case_map Primitives Students (sophomores/juniors/seniors) at Aoyama Gakuin University

ISO-8859-2: Yushiro Ishii (石井 優史朗) ISO-8859-3: Kanon Shindo (新藤 海音) ISO-8859-4: Kotaro Yoshida (吉田 孝太郎) ISO-8859-5: Masaru Onodera (小野寺 俊) ISO-8859-7: Kosuke Kurihara (栗原 光祐) ISO-8859-9: Kazuki Iijima (飯島 一貴) ISO-8859-10: Toya Hosokawa (細川 登陽) ISO-8859-13: Takuya Miyamoto (宮本 拓弥) ISO-8859-14: Yutaro Tada (多田 悠太朗) ISO-8859-15: Maho Harada (原田 真帆) ISO-8859-16: Satoshi Kayama (香山 智志) Windows-1250, -1257: Sho Koike (小池 翔) Windows-1251: Shunsuke Sato (佐藤 駿介) Windows-1252: Serina Tai (田井 芹奈) Windows-1253: Takumi Koyama (小山 拓美)

So What about Shift_JIS and Friends?

For East Asian encodings (Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...) data could be shared between //i and case mapping but case folding for //i only works for ASCII None of the main Japanese committers thought this was needed anymore

Talk to me if you need it

Reusing Case Folding Data

Onig[uruma|gmo] has data for case folding Folding is very close to downcase There is also unfolding (why?), which is close to upcase That's almost all we need

Folding Data: Before and After in enc/unicode/9.0.0/casefold.h

/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* → (Cherokee) */ /* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* → (Cherokee) */

Folding Data: Flags

(squeezed into an int where only 2 bits were used) see enc/unicode.c

/* data is available here */ /* (flags are the same as for options) */ #define U ONIGENC_CASE_UPCASE #define D ONIGENC_CASE_DOWNCASE #define F ONIGENC_CASE_FOLD /* data is in special additional array */ #define ST ONIGENC_CASE_TITLECASE #define SU ONIGENC_CASE_UP_SPECIAL #define SL ONIGENC_CASE_DOWN_SPECIAL #define IT ONIGENC_CASE_IS_TITLECASE /* index into special array (size: around 420 words only) */ #define I(n) OnigSpecialIndexEncode(n)

Small Implementation Detail

(or my attempt at using the Takahashi method)

upcase seems useful downcase seems useful

capitalize seems useful

swapcase

Who would use swapcase?

Nobody?

Nobody?

Well, I did, when testing swapcase!

Why swapcase?

Why swapcase?

Python has it ?! (Matz)

Why swapcase?

Python has it ?! (Matz)

To revert accidental Caps Lock output ?! (on Unicode list)

implementing swapcase must be easy UPPER upper lower LOWER But what about titlecase?

Dz, Dž, Lj, Nj ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1 "DžunGLA".swapcase leave as is "DžUNgla"

preferred by Unicode Consortium (never ever need any new standardization)

preserves reversibility (X.swapcase.swapcase == X)

Choice 2 "DžunGLA".swapcase upcase "DŽUNgla"

Choice 3 "DžunGLA".swapcase downcase "džUNgla"

Choice 4 "DžunGLA".swapcase swap "dŽUNgla"

proposed by Nobuyoshi Nakada

Implemented swap "dŽUNgla"

useless?, but 'correct' additional effort for implementation additional effort for testing

Commit Date April 1st, 2016

(エイプリルフールの日) Japan Time 20:58:33 same date in most timezones please draw your own conclusions

Testing

Test-Driven Development

Write small example test Verify that it doesn't work Implement Enjoy that it works Rinse and repeat

Files: test/ruby/enc/test_case_options.rb test/ruby/enc/test_case_mapping.rb Data-Driven Testing

Test every character (except for ranges in UnicodeData.txt) of every encoding for all option combinations for (almost) all methods Data provided by Unicode Identical to data used for implementation ?!

Files: test/ruby/enc/test_case_comprehensive.rb 413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips

Continuous Integration

Commit early, commit often Advice (and scolding) from hardcore Ruby hackers Keep code reasonably clean, and motivation high More commits → higher chance to attend Ruby Kaigi for free But: Don't want to affect Ruby build or execution Solution: Make use of new functionality dependent on special option Used :lithuanian (because last to be actually implemented) Test with option protection Remove option protection

Future:

Ideas, Problems, Questions

In No Particular Order

Character properties Locale-aware formatting What to do with encodings?

Character Properties

Unicode provides a wide range of character properties Most available in Regexp Does this string contain a Hiragana character? 코 'Юに δ' =~ /\p{Hiragana}/ What script is 'Ю'? sorry, impossible! 不可能! Currently looking at this with a student, hopefully For Ruby ~2.5 Use less memory Faster More properties More ways to use

Locale-Aware Formatting

What I want: loc = Locale.new 'de-CH' (German as used in Switzerland)

1.2345678E5.to_s "123456.78"

1.2345678E5.to_s(loc) "123'456,78"

Well, Just use a Library

Internationalization support in libraries:

Pure Ruby: UnicodeUtils ActiveSupport::Multibyte TwitterCLDR C extensions: ICU as a gem: icu, ffi-icu

Example: Unicode Normalization

UnicodeUtils

UnicodeUtils.nfkc string ActiveSupport::Multibyte

ActiveSupport::Multibyte::Chars.new(string).normalize :kc TwitterCLDR

TwitterCldr::Normalization::NFKC.normalize string Native (since Ruby 2.2) string.unicode_normalize :nfkc Libraries avoid monkey patching

not Ruby-like (ライブラリを使うと Ruby らしくない)

Locales and Case Mappings

Possible solution (解決案): loc = Locale.new 'tr' 'Türkiye'.upcase loc 'TÜRKİYE'

Encodings: Less is More?

We discovered flaky support for current encodings (//i case folding: all encodings not at end of test/ruby/enc/test_regex_casefold.rb) The world is moving to Unicode Matz wants to move to UTF-8, slowly but steadily Do we let other encodings die slowly? Or get rid of them in a single step (Ruby3.0?)

Acknowledgments

Kimihito Matsui (松井 仁人) and many other students for help with research and implementations Yui Naruse (成瀬 ゆい), Nobuyoshi Nakada (中田 伸悦) and many other Ruby committers for help and support Matz (まつもと ゆきひろ) for Ruby, a programmer's best friend Amaya, Opera 12.17, and coderay for slide production and display The IME Pad for easy character input

Conclusions Full Unicode case mapping (mostly) implemented Options for backward compatibility, special conventions, case folding Space efficient implementation by reusing Regexp data Available in Ruby trunk now, please test! More internationalization work needed Tell me what you want most

References

More information about case conversion implementation internals: http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/ (video at http://rubykaigi.org/2016/presentations/duerst.html)

Q & A

Send questions and comments to Martin Dürst (mailto:[email protected]) or open a bug report or feature request for Ruby

The latest version of this presentation is available at: http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/