True Scripts in Library Catalogs – the Way Forward
Total Page:16
File Type:pdf, Size:1020Kb
ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program True Scripts in Library Catalogs – The Way Forward Joan M. Aliprand Senior Analyst, RLG © 2004 RLG Why the current limitation? •Coordination! •As complex as Format Integration! www.ala.org/alcts 1 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Script Capability and Data Exchange Before Unicode System A V Latin data (ASCII, ANSEL) System B V JACKPHY V CJK data (EACC) CJK capable V capable HAPY data (Hebrew, Arabic) U can only use the romanized data • JACKPHY = East Asian scripts (CJK), Arabic, and Hebrew • CJK = scripts for Chinese, Japanese, Korean • HAPY = scripts for Hebrew, Arabic, Persian, Yiddish Script Capability and Data Exchange With unrestrained Unicode System U+ V Latin data System A ? V CJK data Unicode ? JACKPHY capable capable V HAPY data ? Other non-Roman data U can only use the romanized data • JACKPHY = East Asian scripts (CJK), Arabic, and Hebrew • CJK = scripts for Chinese, Japanese, Korean • HAPY = scripts for Hebrew, Arabic, Persian, Yiddish www.ala.org/alcts 2 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Script Capability and Data Exchange With unrestrained Unicode System U+ V Latin data System A ? Unicode > CJK data U JACKPHY capable capable > HAPY data U Other non-Roman data U >…data – Outside the scope of the individual MARC 21 sets U – Can only use the romanized data in the record CHARACTERS OF THE MARC 21 RECORD “MARC-8” • Latin character sets (default) • Basic Latin “ASCII” (US standard ANSI X3.4) • Extended Latin “ANSEL” (ANSI/NISO Z39.47) • Euro sign € and Eszett ß for UKMARC alignment • Supplementary character sets • Superscripts • Subscripts • Greek symbols • Non-Roman character sets • Arabic, Cyrillic, EACC (CJK), Greek, Hebrew www.ala.org/alcts 3 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program CHARACTERS OF THE MARC 21 RECORD Use of Unicode • UTF-8 encoding form • Designation of Unicode as encoding – Leader position 09, Character coding scheme • Single character set, so no 066 field, Character Sets Present • Initial limitation to MARC-8 character repertoire • MARBI Proposal 98-18 • Exception for Canadian Aboriginal Syllabics • MARBI Proposal 2002-11 – Encoded only as Unicode characters • Addition to MARC-8 sets not requested Character Sets in the MARC 21 Record Record Component Character Sets Allowed Leader ASCII graphic characters Directory ASCII numeric characters Control Fields (00X) ASCII Basic Latin, Extended Latin plus Data Fields (except 880) € and ß, Superscripts, Subscripts, Greek symbols Alternate Graphic At least one non-Roman Representation (880) character set required; ASCII www.ala.org/alcts 4 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Unicode in the MARC 21 Record Record MARC-8 Character Sets Unicode (UTF-8) Component Leader ASCII graphic characters ASCII graphic characters Directory ASCII numeric characters ASCII numeric characters Control Fields ASCII ASCII graphic characters (00X) Data Fields Basic Latin, Extended Latin plus € and ß, Superscripts, Latin | Common | Inherited (except 880) Subscripts, Greek symbols Alternate Graphic At least one Script other Representation All than Latin; ASCII characters (880) Accented Letters? • Two alternatives: – Strictly decomposed only? • Base letter followed by combining characters – Both composite characters and combined character sequences permitted • Use Unicode-defined canonical equivalence and canonical ordering www.ala.org/alcts 5 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program The Order of Data • Sorting; Filing order; Collation • Not only for presentation of results – Essential part of query matching • Sorting by character code? – A primitive approach – DOES NOT WORK FOR UNICODE Unicode Collation Algorithm • Default order for all characters of each script • Four levels of differences • Diacritics significant at Secondary level • Case significant at Tertiary level • “Tailoring” methodology provided – For alphabetical order of a particular language – For library practices • Does not do everything – Separate character folding needed in some cases • Medial, final forms in Hebrew; CJK ideographic variants www.ala.org/alcts 6 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program True Scripts in Authority Records One of the factors that continues to delay the implementation of original scripts in NACO authority data is that not all places that support copies of NAF for input/update of NACO records can yet cope with all of the original scripts that might be included in these records. In order to maintain virtually identical copies of NAF, users are not allowed to create headings with original scripts that might be invisible or deleted in other copies of the file. – Ed Glazier, RLG Script Capability and Authority Records As of July 1987 LSP Participant Script LC RLG OCLC Latin V V V CJK V V Cyrillic V www.ala.org/alcts 7 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program MARC 21 Format for Authority Records • Data elements for non-Roman scripts – Same as in Format for Bibliographic Records • Alternate graphic representation (880 field) • Character sets specified (066 field) • Subfield 6 for field linkage, directionality – Assumption of 1:1 field equivalence – Not yet implemented in NAF, SAF • Linkage between established headings – 7XX Heading Linking Entry fields Excerpt from NAF Record 110 20 ‡a United Nations English 410 20 ‡a Nations Unies French 410 20 ‡a Naciones Unidas Spanish 410 20 ‡a Organizatsiia Russian (ALA-LC romanization) Ob˝edinennykh Natsi 410 20 ‡a Lien he guo Chinese in pinyin (ALA-LC romanization) 410 20 ‡a Lien ho kuo Chinese (Wade-Giles romanization) Arabic (ALA-LC romanization) 410 20 ‡a Umam al-Muttah.idah www.ala.org/alcts 8 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Excerpt with True Scripts 110 20 ‡a United Nations 410 20 ‡a Nations Unies 410 20 ‡a Naciones Unidas 410 20 ‡a Organizatsiia Ob˝edinennykh Natsi 410 20 ‡a Lien he guo 410 20 ‡a Lien ho kuo 410 20 ‡a Umam al-Muttah. idah 880 20 ‡6 410-00 ‡a 880 20 ‡6 410-00 ‡a 880 20 ‡6 410-00/r ‡a 110 2_ ‡a United Nations 110 2_ ‡a Nations Unies 110 2_ ‡a 710 710 110 2_ ‡a Naciones Unidas 110 2_ ‡a 110 2_ ‡a www.ala.org/alcts 9 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Linkage within Authority Records • Romanized/non-Roman field pairs? – Need to see all data in authority record – Only one source of name or subject authority for 1XX – 1:many , many:1 relationships in authorities Linkage within Authority Records • Romanized/non-Roman field pairs? – Need to see all data in authority record – Only one source of name or subject authority for 1XX – 1:many , many:1 relationships in authorities UNLINKED 880 FIELDS www.ala.org/alcts 10 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Linkage within Authority Records • Romanized/non-Roman field pairs? • Need to see all data in authority record • Only one source of name or subject authority for 1XX • 1:many , many:1 relationships in authorities UNLINKED 880 FIELDS • Any need to use 880 fields? – Substitution of data as for bibliographic display does not apply – To simplify checking of field contents? – To simplify implementation? Characters in Authority Records Rules for 13 Characters as of 01/06/2001 Character Rule for Use in Authority Records Spacing circumflex Substitute %5E for now Spacing underscore Substitute %5F for now Spacing grave Should not occur Curly brackets (open & close) Use Spacing tilde Substitute %7E for now Degree sign Substitute superscript 0 for now Lowercase script l Pass through; do not actively supply Phono copyright mark Pass through; do not actively supply Copyright mark Pass through; do not actively supply Sharp Substitute number sign for now Inverted question mark Use when available Inverted exclamation mark Use when available www.ala.org/alcts 11 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program MARC 21 Graphic Cumulative Unicode Date Character Sets Chars. Total Graphic Characters 1968 Default 145 145 1968 Additional 31 176 198x Greek 73 249 198x EACC 15,738 15,987 1986 Cyrillic 135 16,122 1988 Hebrew 78 16,200 1991 Arabic 173 16,373 Version 1.0 - 28,302 1992 Version 1.1 – 34,169 1994 Latin 13 16,386 1996 Version 2.0 - 38,885 2002 Version 3.0 – 49,194 2002 CAS (UTF-8); 632 17,018 Euro, Eszett 2003 - 2 17,020 Version 4.0 – 96,382 Reality Check “Unicode” is not a magic word • Shorthand for The Unicode Standard • Potential for multiscript support • Unicode is not software – Underlying software comes from OEMs • Unicode is the foundation, not the whole edifice – Many parts to a system www.ala.org/alcts 12 ALCTS "Library Catalogs and Non-Roman ALA Annual Conference Orlando Scripts" Program Corporate Members Adobe Systems; Agence Intergouvernementale de la Francophonie; Apple Computer; Basis Technology; Hewlett-Packard; IBM; India (Ministry of Information Technology); Justsystem; Microsoft; Oracle; Pakistan (National Language Authority); PeopleSoft; RLG; SAP; Sun; Sybase Associate Members 33 Associate Members including: The Church of Jesus Christ of Latter-day Saints; Columbia University; Endeavor Information Systems; Ex Libris; Innovative Interfaces; The Library Corporation; OCLC; SIL International; SIRSI; VTLS www.ala.org/alcts 13.