The Use of Unicode™ in MARC 21 Records What Is MARC?

ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program The Use of Unicode™ in MARC 21 Records Joan M. Aliprand Senior Analyst, RLG What is MARC? • MAchine-Readable Cataloging • MARC is an exchange format • Focus on MARC 21 exchange format An implementation may disguise certain features of MARC – Parallel fields in RLIN are the regular field and its 880 equivalent in MARC www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program MARC-8 The Individual Character Sets of MARC 21 Scholarship in all languages plus Greek, Superscripts, Subscripts, and Greek Symbols The Alternative! www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program MARC-8 vs. Unicode 245 00 <HBR>qcx dbcd yl tgq<LAT> =‡b<CYR>aGA DA [EL pESAH<LAT> = Hebrew-Russian Hagga dah for Passover.<EOF> EOF is End Of Field first 245 00 =‡b = Hebrew- Russian Hag gadah for last Passover.<EOF> Encoding Forms of Unicode • UTF-8 – Characters encoded as ASCII-compatible sequences of 8-bit bytes – Chosen for MARC 21 record exchange • UTF-16 – Characters encoded in 16 bits – Surrogate technique for characters beyond Basic Multilingual Plane (BMP) • UTF-32 – 1-to-1 relationship between encoded character and code unit • ALL are Unicode! • ALL Unicode characters can be represented in every encoding form! www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program UTF-8 Size in Coverage Octets 1 ASCII repertoire A-Z, a-z, 0-9, punctuation, # % & * + < = > @ … Latin script extensions, other European and Middle Eastern 2 scripts, IPA extensions, combining diacritical marks, … South, Southeast and East Asian 3 scripts, common symbols, … Historic scripts, symbols for 4 music and math, additional ideographs, … “octet” = 8-bits MARC 21 Alternatives MARC-8 – Leader 09 blank – Discrete 8-bit character sets – “Announcers” for character sets – Non-default sets listed in 066 field – 880 fields for “alternate graphic representation” – Linked / unlinked – Non-spacing marks before base character www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program MARC 21 Alternatives MARC-8 Unicode – Leader 09 blank Leader 09 = “a” – Discrete 8-bit UTF-8 encoding character sets throughout – “Announcers” for No “announcers” character sets – Non-default sets No 066 field listed in 066 field – 880 fields for 880 fields for “alternate graphic “alternate graphic representation” representation” – Linked / unlinked ° Linked / unlinked – Non-spacing ‹ Non-spacing marks before base marks after base character character RLIN Differences RLIN® ITPS – Field pairs – Linked / unlinked – “Core fields” – Non-spacing marks before base character – Numeric data in “visual order” – Proprietary fonts and keyboards – See all scripts or only romanization – Abbreviations on displays for scripts in record www.ala.org/alcts 5 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program RLIN Differences RLIN® ITPS RLIN21™ client – Field pairs – Field pairs – Linked / unlinked – Linked / unlinked • “Core fields” • “Core fields” – Non-spacing Non-spacing marks before base marks after base character character – Numeric data in Regular order “visual order” – Highest no. first – Proprietary fonts Regular fonts, and keyboards RLG keyboard – See true scripts or See all scripts romanization • No toggling – Abbreviations on “scripts” indicates displays for scripts non-Roman data in record in record Non-Spacing Marks • Combining characters come after the base character in Unicode – May affect non-filing indicator count – L’ e toile (3) versus L’e t oile (2) • Specified order in Unicode for combining characters – Affects multiple diacritics that interact – Do you ever include nekudot? www.ala.org/alcts 6 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Non-Filing Indicators • Affected by different order of base character and non-spacing marks – Known case of article in Greek • Romanized alif and ayn – Symbols representing letters – Treated as non-spacing diacritics • Different non-filing counts for romanized Hebrew or Yiddish and the same text in original script – OCLC and RLG objections to current treatment not accepted by LC – Currently disregarded in searching throughout text European Numbers “Arabic” numbers (= from the Arabs) RLIN Terminal for Windows • Basic Latin character set – Most significant digit first • 2, 0, 0, 4 • Hebrew character set – Least significant digit first • 4, 0, 0, 2 RLIN21 Editor & RLIN21 searching – Most significant digit first • 2, 0, 0, 4 www.ala.org/alcts 7 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Traditional Arabic Numbers “Hindi” numbers (= from the Indians) RLIN Terminal for Windows • Basic Arabic character set – Least significant digit first • , , , RLIN21 Editor & RLIN21 searching – Most significant digit first • , , , Date Spans • Multi-digit numbers entered most significant digit first • In RTL field, earlier date is on the right – RTL overall, with dates LTR • Example: 1967-1948 – Same with Latin hyphen or with Hebrew makef • For reverse order, special Unicode formatting characters required • Example: 1948-1967 – Not preserved in current RLG Union Catalog – Input record using RTFW if this order required – Will be supported post-migration www.ala.org/alcts 8 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Headings • Name headings established under AACR2 – Communally created file – Hebrew funnel • LCSH – External contributions via SACO • Names, subjects in Hebrew script languages – Uncontrolled – Locally controlled Authority Record Options • AACR2 record with cross- references from ALL forms of name in other scripts • An authority file for each language/source of authority with authorized headings for the same entity providing connections to other authority files – VIAF model – YIVO card catalog conceptually www.ala.org/alcts 9 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program MARC Authority Format • Hospitable to either option • Alternate graphic representation – Specified – Not implemented in NAF – Hong Kong has authority file of Chinese names written in ideographs; uses MARC 21 • No linking of romanized and 880 fields!! Gazing into the Future www.ala.org/alcts 10 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Gazing into the Future • Will Alternate Graphic Representation persist? – DP 111 (1998) Alternate Graphics without 880 – Needed while Roman-only systems exist Gazing into the Future • Hebrew headings in bibliographic records? – Are Hebrew cross-references in NAF sufficient for access? (Only AACR2 headings in records) – If not, why not? www.ala.org/alcts 11 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Gazing into the Future • A Hebrew NAF? – Language, not script, is fundamental: Hebrew language file, Yiddish language file, … – What is the source of cataloging authority for each file? – Coordination with Israel and other centers of Hebraica for Hebrew language file? – Decision about alternative orthographies (Hebrew) or requirement for use of standard orthography (Yiddish) Gazing into the Future • There is a true script catalog in your future … – What you need to know – What you need to consider www.ala.org/alcts 12 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Planning for a Script • Adequate character repertoire? • Need for a rendering algorithm? – Unicode Bidirectional Algorithm • Word definition in languages? • Special requirements for indexing? • Defined sort order? • Adequate fonts? Making the Case • Why should management spend money on your script? • “Isn’t romanization good enough?” – Examples of problems www.ala.org/alcts 13 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program “Conformance” • A very special meaning in the context of Unicode – Specified in Chapter 3 • Not a general synonym for “makes use of Unicode” • Pay attention to how it’s used Testing (The Skill of the Black Thumb) • Work from planned test scripts • Verify every character, not just some – Check treatment of spacing “marks” in conversion to Unicode (they are not diacritics!) • Don’t test just the obvious • Test negative conditions as well as positive ones • Stress the system with evil “corner cases” www.ala.org/alcts 14 ALCTS "Library Catalogs and Non- ALA Annual Conference Orlando Rom an Scripts: Developm ent and Im plem entation of UnicodeTM for Cataloging and Public Access" Program Reality Check • Unicode provides the potential for multiscript support • Unicode is not software – Underlying software comes from OEMs • Unicode is the foundation, not the whole building – Many parts to a system http://www.unicode.org www.ala.org/alcts 15.

The Use of Unicode™ in MARC 21 Records What Is MARC?

1 Introduction 1

Package Mathfont V. 1.6 User Guide Conrad Kosowsky December 2019 [email protected]

A Ruse Secluded Character Set for the Source

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

PCL PC-8, Code Page 437 Page 1 of 5 PCL PC-8, Code Page 437

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Legacy Character Sets & Encodings

Basis Technology Unicode対応ライブラリスペックシート文字コードその他の名称 Adobe-Standard-Encoding A

Implementing Cross-Locale CJKV Code Conversion

San José, October 2, 2000 Feel Free to Distribute This Text

IPA Extensions

Iso/Iec Jtc1/Sc2/Wg2 N5148 L2/20-266