Camomile : a Unicode Library for Ocaml
Total Page:16
File Type:pdf, Size:1020Kb
Camomile : A Unicode library for OCaml Yoriyuki Yamagata National Institute of Advanced Science and Technology (AIST) ML Workshop, September 18, 2011 Outline Overview ASCII to Unicode : A challenge of multilingualization A brief tour of Camomile modules ulib Conclusion Outline Overview ASCII to Unicode : A challenge of multilingualization A brief tour of Camomile modules ulib Conclusion I Unicode character type I UTF-8, UTF-16, UTF-32 strings I Conversion to/from approx 200 encodings I Case mapping I Collation (sort and search) Camomile - A Unicode library for OCaml Overview - functionality I Unicode character type I UTF-8, UTF-16, UTF-32 strings I Conversion to/from approx 200 encodings I Case mapping I Collation (sort and search) Overview - functionality Camomile - A Unicode library for OCaml I UTF-8, UTF-16, UTF-32 strings I Conversion to/from approx 200 encodings I Case mapping I Collation (sort and search) Overview - functionality Camomile - A Unicode library for OCaml I Unicode character type I Conversion to/from approx 200 encodings I Case mapping I Collation (sort and search) Overview - functionality Camomile - A Unicode library for OCaml I Unicode character type I UTF-8, UTF-16, UTF-32 strings I Case mapping I Collation (sort and search) Overview - functionality Camomile - A Unicode library for OCaml I Unicode character type I UTF-8, UTF-16, UTF-32 strings I Conversion to/from approx 200 encodings I Collation (sort and search) Overview - functionality Camomile - A Unicode library for OCaml I Unicode character type I UTF-8, UTF-16, UTF-32 strings I Conversion to/from approx 200 encodings I Case mapping Overview - functionality Camomile - A Unicode library for OCaml I Unicode character type I UTF-8, UTF-16, UTF-32 strings I Conversion to/from approx 200 encodings I Case mapping I Collation (sort and search) I Only support “logical” operations I No support for rendering or formatting I Purely written in OCaml Overview - feature I No support for rendering or formatting I Purely written in OCaml Overview - feature I Only support “logical” operations I Purely written in OCaml Overview - feature I Only support “logical” operations I No support for rendering or formatting Overview - feature I Only support “logical” operations I No support for rendering or formatting I Purely written in OCaml Outline Overview ASCII to Unicode : A challenge of multilingualization A brief tour of Camomile modules ulib Conclusion UTF-8, UTF-16 and UTF-32 legacy encodings ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Multiple representation of strings Combining characters Diverse cultural conventions Large number of characters ASCII to Unicode : challenge of multilingualization UTF-8, UTF-16 and UTF-32 legacy encodings ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Multiple representation of strings Combining characters Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters UTF-8, UTF-16 and UTF-32 legacy encodings ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Multiple representation of strings Combining characters Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) legacy encodings Combining characters Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Combining characters Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä = a + ¨ Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) â. = a + . + ˆ = a + ˆ + . Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Diverse cultural conventions ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Diverse cultural conventions Sorting ... < H < CH < I < ... (Slovak) ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Diverse cultural conventions Case mapping OΣOΣ ! oσo& (Greek) ASCII to Unicode : challenge of multilingualization Large number of characters code range 0x0 - 0x10ffff Multiple representation of strings UTF-8, UTF-16 and UTF-32 legacy encodings Combining characters ä = a + ¨ Nguyên˜ = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en â. = a + . + ˆ = a + ˆ + . Diverse cultural conventions Case mapping OΣOΣ ! oσo& (Greek) Sorting ... < H < CH < I < ... (Slovak) Outline Overview ASCII to Unicode : A challenge of multilingualization A brief tour of Camomile modules ulib Conclusion module Camomile= CamomileLibrary.Make(Parameters ) Camomile modules - Initialization module Camomile= CamomileLibrary.Make(Parameters) Camomile modules - Initialization module Camomile= CamomileLibrary.Make(Parameters ) Parameter: sig val datadir: string val charmapdir: string val unimapdir: string val localedir: string end Camomile modules - Initialization module Camomile= CamomileLibrary.Make(Parameters ) Parameter: sig val datadir : string val charmapdir: string val unimapdir: string val localedir: string end Location of compiled Unicode database Camomile modules - Initialization module Camomile= CamomileLibrary.Make(Parameters ) Parameter: sig val datadir: string val charmapdir : string val unimapdir: string val localedir: string end Location of compiled mapping tables for character encodings Camomile modules - Initialization module Camomile= CamomileLibrary.Make(Parameters ) Parameter: sig val datadir: string val charmapdir: string val unimapdir : string val localedir: string end Location of compiled mapping tables for East Asian encodings Camomile modules - Initialization module Camomile= CamomileLibrary.Make(Parameters ) Parameter: sig val datadir: string val charmapdir: string val unimapdir: string val localedir : string end Location of compiled locale data Camomile modules - UChar typet exception Out_of_range val char_of:t -> char val of_char: char ->t val code:t -> int val chr: int ->t val eq:t ->t -> bool val compare:t ->t -> int Camomile modules - UChar typet exception Out_of_range val char_of:t -> char val of_char: char ->t val code:t -> int val chr: int ->t val eq:t ->t -> bool val compare:t ->t -> int Unicode type and exception Camomile modules - UChar typet exception Out_of_range val char_of:t -> char val of_char: char ->t val code:t -> int val chr: int ->t val eq:t ->t -> bool val compare:t ->t -> int Conversion to/from char Camomile modules - UChar typet exception Out_of_range val char_of:t -> char val of_char: char ->t val code:t -> int val chr: int ->t val eq:t ->t -> bool val compare:t