Implications of Punctuation Mark Normalization on Text Retrieval. Doctor Of

Total Page:16

File Type:pdf, Size:1020Kb

Implications of Punctuation Mark Normalization on Text Retrieval. Doctor Of IMPLICATIONS OF PUNCTUATION MARK NORMALIZATION ON TEXT RETRIEVAL Eungi Kim, B.S., M.S. Dissertation Prepared for the Degree of DOCTOR OF PHILOSOPHY UNIVERSITY OF NORTH TEXAS August 2013 APPROVED: William E. Moen, Major Professor Brian C. O’Connor, Committee Member Christina Wasson, Committee Member Suliman Hawamdeh, Chair of the Department of Library and Information Sciences Herman Totten, Dean of College of Information Mark Wardell, Dean of the Toulouse Graduate School Kim, Eungi. Implications of punctuation mark normalization on text retrieval. Doctor of Philosophy (Information Science), August 2013, 278 pp., 41 tables, 20 figures, references, 94 titles. This research investigated issues related to normalizing punctuation marks from a text retrieval perspective. A punctuated-centric approach was undertaken by exploring changes in meanings, whitespaces, words retrievability, and other issues related to normalizing punctuation marks. To investigate punctuation normalization issues, various frequency counts of punctuation marks and punctuation patterns were conducted using the text drawn from the Gutenberg Project archive and the Usenet Newsgroup archive. A number of useful punctuation mark types that could aid in analyzing punctuation marks were discovered. This study identified two types of punctuation normalization procedures: (1) lexical independent (LI) punctuation normalization and (2) lexical oriented (LO) punctuation normalization. Using these two types of punctuation normalization procedures, this study discovered various effects of punctuation normalization in terms of different search query types. By analyzing the punctuation normalization problem in this manner, a wide range of issues were discovered such as: the need to define different types of searching, to disambiguate the role of punctuation marks, to normalize whitespaces, and indexing of punctuated terms. This study concluded that to achieve the most positive effect in a text retrieval environment, normalizing punctuation marks should be based on an extensive systematic analysis of punctuation marks and punctuation patterns and their related factors. The results of this study indicate that there were many challenges due to complexity of language. Further, this study recommends avoiding a simplistic approach to punctuation normalization. Copyright 2013 By Eungi Kim ii ACKNOWLEDGEMENTS First and foremost, I want to thank God for giving me the strength and perseverance to finish this journey. I wish to express my sincere gratitude and appreciation to everyone who contributed to this dissertation. In particular, I would like to thank my committee members for their patience and unyielding support. I’m grateful to my dissertation chair Dr. William Moen. He assisted with every aspect of completing this dissertation. We’ve had numerous detailed discussions, and his thoughtful comments were always useful. My other committee members, Dr. Brian O’Connor and Dr. Christina Wasson, whom I have known for all these years, contributed their time generously in serving on my proposal and dissertation defense. They offered me with intellectually stimulating comments, and their valuable comments helped me to avoid making a number of potential critical mistakes in this dissertation. I am also indebted to Dr. Shawne Miksa, Ph.D. Program Coordinator at University of North Texas. She always provided me with assistance and guidance whenever I needed. I would also like to thank Dr. Suliman Hawamdeh, Dr. Linda Schamber, and Dr. Victor Prybutok for their roles in granting extension to complete my Ph.D. dissertation. I also want to thank Malaina Mobley, other staff members, and the faculty of The Library and Information Sciences Department at the University North Texas for their assistance in various matters. Finally, this dissertation would not have been possible without the support of my family members. I would like to thank my wife Eunhee for her personal support and great patience. My precious two daughters, Minhae and Ester, have been patience with me for all these years. My sister Eunsil has been a reliable support of my life. I cannot thank enough to my father Young Jong Kim and my mother Sook Ja Kim for their endless love and support. They helped me with everything I needed to continue my education and complete this dissertation. iii TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ........................................................................................................... iii LIST OF TABLES ........................................................................................................................ vii LIST OF FIGURES ....................................................................................................................... ix CHAPTER 1 INTRODUCTION .....................................................................................................1 The Notion of Punctuation Normalization ...........................................................................6 Research Motivation ..........................................................................................................12 Theoretical Framework ......................................................................................................15 The Linguistic (Computational) Perspective .........................................................15 Information Retrieval (Text Retrieval) ..................................................................17 Information Extraction (IE) ...................................................................................20 Integrated Theoretical Framework .........................................................................22 Research Questions ............................................................................................................26 Research Design and Data Collection................................................................................27 Limitation of the Study ......................................................................................................28 Organization .......................................................................................................................29 CHAPTER 2 BACKGROUND AND LITERATURE REVIEW .................................................30 Linguistic and Computational Analysis of Punctuation Marks .........................................30 Punctuation Mark Categorization ..........................................................................32 Frequency Analysis of Punctuation Marks and Punctuation Pattern .....................35 Punctuation Marks in Text Processing Related Procedures ..............................................40 Tokenization ..........................................................................................................40 Indexing and Punctuation Marks ...........................................................................43 Effectiveness of Search Queries and Punctuation Normalization .........................45 Existing Approaches to Punctuation Normalization Practices ..........................................47 Google Search Engine............................................................................................48 LexusNexis ............................................................................................................52 Concepts Related to Punctuation Normalization ...............................................................54 Character Encoding and Character Normalization ................................................54 iv Normalization of Capitalized Words .....................................................................57 Punctuation Marks in Forms of Words and Expressions ...................................................58 Abbreviations and Non-Standard Words ...............................................................58 Non-Standard Expressions .....................................................................................62 Chapter Summary ..............................................................................................................62 CHAPTER 3 METHODOLOGY AND RESEARCH DESIGN ...................................................64 Overview of Research Strategy .........................................................................................64 The Analysis Setting ..........................................................................................................66 Methodological Procedures ...............................................................................................73 Highlights of Methodological Procedures .........................................................................74 Tokenization (Segmentation) .................................................................................75 Frequency Counts and Punctuation Mark Types ...................................................80 Additional Types of Punctuation ...........................................................................81 Punctuation Normalization Types ..........................................................................84 Types of Search Queries and Punctuation Normalization Effects .........................86 The Normalization Demonstration ........................................................................87 Methodological Limitation ................................................................................................89 Chapter Summary ..............................................................................................................90
Recommended publications
  • Edit Bibliographic Records
    OCLC Connexion Browser Guides Edit Bibliographic Records Last updated: May 2014 6565 Kilgour Place, Dublin, OH 43017-3395 www.oclc.org Revision History Date Section title Description of changes May 2014 All Updated information on how to open the diacritic window. The shortcut key is no longer available. May 2006 1. Edit record: basics Minor updates. 5. Insert diacritics Revised to update list of bar syntax character codes to reflect and special changes in character names and to add newly supported characters characters. November 2006 1. Edit record: basics Minor updates. 2. Editing Added information on guided editing for fields 541 and 583, techniques, template commonly used when cataloging archival materials. view December 2006 1. Edit record: basics Updated to add information about display of WorldCat records that contain non-Latin scripts.. May 2007 4. Validate record Revised to document change in default validation level from None to Structure. February 2012 2 Editing techniques, Series added entry fields 800, 810, 811, 830 can now be used to template view insert data from a “cited” record for a related series item. Removed “and DDC” from Control All commands. DDC numbers are no longer controlled in Connexion. April 2012 2. Editing New section on how to use the prototype OCLC Classify service. techniques, template view September 2012 All Removed all references to Pathfinder. February 2013 All Removed all references to Heritage Printed Book. April 2013 All Removed all references to Chinese Name Authority © 2014 OCLC Online Computer Library Center, Inc. 6565 Kilgour Place Dublin, OH 43017-3395 USA The following OCLC product, service and business names are trademarks or service marks of OCLC, Inc.: CatExpress, Connexion, DDC, Dewey, Dewey Decimal Classification, OCLC, WorldCat, WorldCat Resource Sharing and “The world’s libraries.
    [Show full text]
  • Beginning Portable Shell Scripting from Novice to Professional
    Beginning Portable Shell Scripting From Novice to Professional Peter Seebach 10436fmfinal 1 10/23/08 10:40:24 PM Beginning Portable Shell Scripting: From Novice to Professional Copyright © 2008 by Peter Seebach All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-13 (pbk): 978-1-4302-1043-6 ISBN-10 (pbk): 1-4302-1043-5 ISBN-13 (electronic): 978-1-4302-1044-3 ISBN-10 (electronic): 1-4302-1044-3 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1 Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Lead Editor: Frank Pohlmann Technical Reviewer: Gary V. Vaughan Editorial Board: Clay Andres, Steve Anglin, Ewan Buckingham, Tony Campbell, Gary Cornell, Jonathan Gennick, Michelle Lowman, Matthew Moodie, Jeffrey Pepper, Frank Pohlmann, Ben Renow-Clarke, Dominic Shakeshaft, Matt Wade, Tom Welsh Project Manager: Richard Dal Porto Copy Editor: Kim Benbow Associate Production Director: Kari Brooks-Copony Production Editor: Katie Stence Compositor: Linda Weidemann, Wolf Creek Press Proofreader: Dan Shaw Indexer: Broccoli Information Management Cover Designer: Kurt Krames Manufacturing Director: Tom Debolski Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor, New York, NY 10013.
    [Show full text]
  • Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim, Jin Hyung Kim
    Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim, Jin Hyung Kim To cite this version: Dae Hwan Kim, Jin Hyung Kim. Ligature Modeling for Recognition of Characters Written in 3D Space. Tenth International Workshop on Frontiers in Handwriting Recognition, Université de Rennes 1, Oct 2006, La Baule (France). inria-00105116 HAL Id: inria-00105116 https://hal.inria.fr/inria-00105116 Submitted on 10 Oct 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim Jin Hyung Kim Artificial Intelligence and Artificial Intelligence and Pattern Recognition Lab. Pattern Recognition Lab. KAIST, Daejeon, KAIST, Daejeon, South Korea South Korea [email protected] [email protected] Abstract defined shape of character while it showed high recognition performance. Moreover when a user writes In this work, we propose a 3D space handwriting multiple stroke character such as ‘4’, the user has to recognition system by combining 2D space handwriting write a new shape which is predefined in a uni-stroke models and 3D space ligature models based on that the and which he/she has never seen.
    [Show full text]
  • Writing Mathematical Expressions in Plain Text – Examples and Cautions Copyright © 2009 Sally J
    Writing Mathematical Expressions in Plain Text – Examples and Cautions Copyright © 2009 Sally J. Keely. All Rights Reserved. Mathematical expressions can be typed online in a number of ways including plain text, ASCII codes, HTML tags, or using an equation editor (see Writing Mathematical Notation Online for overview). If the application in which you are working does not have an equation editor built in, then a common option is to write expressions horizontally in plain text. In doing so you have to format the expressions very carefully using appropriately placed parentheses and accurate notation. This document provides examples and important cautions for writing mathematical expressions in plain text. Section 1. How to Write Exponents Just as on a graphing calculator, when writing in plain text the caret key ^ (above the 6 on a qwerty keyboard) means that an exponent follows. For example x2 would be written as x^2. Example 1a. 4xy23 would be written as 4 x^2 y^3 or with the multiplication mark as 4*x^2*y^3. Example 1b. With more than one item in the exponent you must enclose the entire exponent in parentheses to indicate exactly what is in the power. x2n must be written as x^(2n) and NOT as x^2n. Writing x^2n means xn2 . Example 1c. When using the quotient rule of exponents you often have to perform subtraction within an exponent. In such cases you must enclose the entire exponent in parentheses to indicate exactly what is in the power. x5 The middle step of ==xx52− 3 must be written as x^(5-2) and NOT as x^5-2 which means x5 − 2 .
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Dictation Presentation.Pptx
    Dictaon using Apple Devices Presentaon October 10, 2013 Trudy Downs Operang Systems • iOS6 • iOS7 • Mountain Lion (OS X10.8) Devices • iPad 3 or iPad mini • iPod 4 • iPhone 4s, 5 or 5c or 5s • Desktop running Mountain Lion • Laptop running Mountain Lion Dictaon Shortcut Words • Shortcut WordsDictaon includes many voice “shortcuts” that allows you to manipulate the text and insert symbols while you are speaking. Here’s a list of those shortcuts that you can use: - “new line” is like pressing Return on your keyboard - “new paragraph” creates a new paragraph - “cap” capitalizes the next spoken word - “caps on/off” capitalizes the spoken sec&on of text - “all caps” makes the next spoken word all caps - “all caps on/off” makes the spoken sec&on of text all caps - “no caps” makes the next spoken word lower case - “no caps on/off” makes the spoken sec&on of text lower case - “space bar” prevents a hyphen from appearing in a normally hyphenated word - “no space” prevents a space between words - “no space on/off” to prevent a sec&on of text from having spaces between words More Dictaon Shortcuts • - “period” or “full stop” places a period at the end of a sentence - “dot” places a period anywhere, including between words - “point” places a point between numbers, not between words - “ellipsis” or “dot dot dot” places an ellipsis in your wri&ng - “comma” places a comma - “double comma” places a double comma (,,) - “quote” or “quotaon mark” places a quote mark (“) - “quote ... end quote” places quotaon marks around the text spoken between - “apostrophe”
    [Show full text]
  • Agricultural Soil Carbon Credits: Making Sense of Protocols for Carbon Sequestration and Net Greenhouse Gas Removals
    Agricultural Soil Carbon Credits: Making sense of protocols for carbon sequestration and net greenhouse gas removals NATURAL CLIMATE SOLUTIONS About this report This synthesis is for federal and state We contacted each carbon registry and policymakers looking to shape public marketplace to ensure that details investments in climate mitigation presented in this report and through agricultural soil carbon credits, accompanying appendix are accurate. protocol developers, project developers This report does not address carbon and aggregators, buyers of credits and accounting outside of published others interested in learning about the protocols meant to generate verified landscape of soil carbon and net carbon credits. greenhouse gas measurement, reporting While not a focus of the report, we and verification protocols. We use the remain concerned that any end-use of term MRV broadly to encompass the carbon credits as an offset, without range of quantification activities, robust local pollution regulations, will structural considerations and perpetuate the historic and ongoing requirements intended to ensure the negative impacts of carbon trading on integrity of quantified credits. disadvantaged communities and Black, This report is based on careful review Indigenous and other communities of and synthesis of publicly available soil color. Carbon markets have enormous organic carbon MRV protocols published potential to incentivize and reward by nonprofit carbon registries and by climate progress, but markets must be private carbon crediting marketplaces. paired with a strong regulatory backing. Acknowledgements This report was supported through a gift Conservation Cropping Protocol; Miguel to Environmental Defense Fund from the Taboada who provided feedback on the High Meadows Foundation for post- FAO GSOC protocol; Radhika Moolgavkar doctoral fellowships and through the at Nori; Robin Rather, Jim Blackburn, Bezos Earth Fund.
    [Show full text]
  • Czech and Slovak Typesetting Rules
    Czech and Slovak Typesetting Rules Tomáš Hála Mendel University in Brno, CZ BachoTEX 2019 Czech and Slovak Typesetting Rules Selected sources − ON 88 2503:1974 − Pop, Flégr and Pop: Sazba I [Typesetting I], 1989 (textbook) − ČSN 01 6910:2007 and older − ČSN 01 6910:2011 − STN 01 6910:2011 − Pravidla českého pravopisu [Rules of Czech Ortography], 1987, − Pravidla českého pravopisu [Rules of Czech Ortography], 1993 − Pravidlá slovenského pravopisu [Rules of Slovak Ortography], 1993, 2000 2 Czech and Slovak Typesetting Rules Spaces intersentence spacing interword space non-breaking interword space thin space 3 Czech and Slovak Typesetting Rules Spaces between sentences intersentence spacing % Czech, Slovak \frenchspacing % English (American) \nonfrenchspacing 4 Czech and Slovak Typesetting Rules Spaces between sentences intersentence spacing % ConTeXt \installlanguage [\s!en] [\c!spacing=\v!broad, ... \installlanguage [\s!cs] [\c!spacing=\v!packed, ... 5 Czech and Slovak Typesetting Rules Dashes: punctuation usage en-dash XOR em-dash en-dash v em-dash: designer’s opinion dashes v spaces: semanticising usage 6 Czech and Slovak Typesetting Rules Dashes: punctuation usage dashes must not open the new line \def\ip{\pdash} % Czech, Slovak \def\pdash{~-- } 7 Czech and Slovak Typesetting Rules Dashes: interval usage Czech and Slovak ∘ 35–45 %, 5–8 C English ∘ ∘ 35%–45%, 5 C–15 C, 70–72 percent %Czech and Slovak \def\idash{\discretionary{\char32až}{}{--}} \def\az{\idash} %English \def\idash{\discretionary{\char32to}{}{--}} 8 Czech and Slovak Typesetting
    [Show full text]
  • A Collection of Mildly Interesting Facts About the Little Symbols We Communicate With
    Ty p o g raph i c Factettes A collection of mildly interesting facts about the little symbols we communicate with. Helvetica The horizontal bars of a letter are almost always thinner than the vertical bars. Minion The font size is approximately the measurement from the lowest appearance of any letter to the highest. Most of the time. Seventy-two points equals one inch. Fridge256 point Cochin most of 50the point Zaphino time Letters with rounded bottoms don’t sit on the baseline, but slightly below it. Visually, they would appear too high if they rested on the same base as the squared letters. liceAdobe Caslon Bold UNITED KINGDOM UNITED STATES LOLITA LOLITA In Ancient Rome, scribes would abbreviate et (the latin word for and) into one letter. We still use that abbreviation, called the ampersand. The et is still very visible in some italic ampersands. The word ampersand comes from and-per-se-and. Strange. Adobe Garamond Regular Adobe Garamond Italic Trump Mediaval Italic Helvetica Light hat two letters ss w it cam gue e f can rom u . I Yo t h d. as n b ha e rt en ho a s ro n u e n t d it r fo w r s h a u n w ) d r e e m d a s n o r f e y t e t a e r b s , a b s u d t e d e e n m t i a ( n l d o b s o m a y r S e - d t w A i e t h h t t , h d e n a a s d r v e e p n t m a o f e e h m t e a k i i l .
    [Show full text]
  • The Selnolig Package: Selective Suppression of Typographic Ligatures*
    The selnolig package: Selective suppression of typographic ligatures* Mico Loretan† 2015/10/26 Abstract The selnolig package suppresses typographic ligatures selectively, i.e., based on predefined search patterns. The search patterns focus on ligatures deemed inappropriate because they span morpheme boundaries. For example, the word shelfful, which is mentioned in the TEXbook as a word for which the ff ligature might be inappropriate, is automatically typeset as shelfful rather than as shelfful. For English and German language documents, the selnolig package provides extensive rules for the selective suppression of so-called “common” ligatures. These comprise the ff, fi, fl, ffi, and ffl ligatures as well as the ft and fft ligatures. Other f-ligatures, such as fb, fh, fj and fk, are suppressed globally, while making exceptions for names and words of non-English/German origin, such as Kafka and fjord. For English language documents, the package further provides ligature suppression rules for a number of so-called “discretionary” or “rare” ligatures, such as ct, st, and sp. The selnolig package requires use of the LuaLATEX format provided by a recent TEX distribution, e.g., TEXLive 2013 and MiKTEX 2.9. Contents 1 Introduction ........................................... 1 2 I’m in a hurry! How do I start using this package? . 3 2.1 How do I load the selnolig package? . 3 2.2 Any hints on how to get started with LuaLATEX?...................... 4 2.3 Anything else I need to do or know? . 5 3 The selnolig package’s approach to breaking up ligatures . 6 3.1 Free, derivational, and inflectional morphemes .
    [Show full text]
  • AGU Grammar and Style Guide
    AGU Grammar and Style Guide 1. Hyphenation . 1 1.1. Attributive Adjectives . 1 1.2. Nouns . 5 1.3. Words Formed With Prefixes . 6 1.4. Words of Equal Weight . 7 2. Commas . 8 2.1. Examples of Correct Usage. 8 2.2. AGU Style . 9 2.3. Comma Usage at Beginning of Sentence . 9 2.4. Some Parts of Speech and Common Examples . 10 3. Additional Grammar/Punctuation Rules . 11 3.1. Adjective/Adverbial Phrases . 11 3.2. Comprise Versus Compose . 11 3.3. Singular Versus Plural With Certain Nouns. 11 3.4. Other Rules . 12 4. Spelling . 14 4.1. Alternate Spellings . 14 4.2. Commonly Used Proper Names . 14 4.3. Countries . 15 5. Capitalization . 16 5.1. Geographical Terms . 16 5.2. Text Capitalization . 17 5.3. Stratigraphic Divisions . 18 6. Numbers . 19 6.1. Cardinal Numbers/Arabic Numerals . 19 6.2. Ordinal Numbers . 19 6.3. Miscellaneous Style for Numbers . 19 7. Miscellaneous Style Rules . 20 8. Special Notations. 22 8.1. Astronomical Notation for Dates and Time. 22 8.2. Degrees, Minutes, and Seconds of Arc. 22 8.3. Units of Measure . 22 8.4. Dimensions. 25 8.5. Seismology. .. 25 8.6. Mineralogy. .. 26 8.7. Ranges. 26 8.8. Ships and Spacecraft. 26 8.9. Comets. .. 27 8.10. Temperature. .. 27 8.11. Times. .. 27 8.12. Storms. 27 8.13. Biology. 27 9. Word List . 28 GRAMMAR/STYLE GUIDE 2/03 ATTRIBUTIVE ADJECTIVES 1 1. Hyphenation The main reason for hyphenation is increased clarity. 1.1. Attributive Adjectives Always hyphen. The following should always be hyphened as attributive adjectives: 1.
    [Show full text]
  • Supplemental Punctuation Range: 2E00–2E7F
    Supplemental Punctuation Range: 2E00–2E7F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]