Package 'Rebus'

Total Page:16

File Type:pdf, Size:1020Kb

Package 'Rebus' Package ‘rebus’ April 25, 2017 Type Package Title Build Regular Expressions in a Human Readable Way Version 0.1-3 Date 2017-04-25 Author Richard Cotton [aut, cre] Maintainer Richard Cotton <[email protected]> Description Build regular expressions piece by piece using human readable code. This package is designed for interactive use. For package development, use the rebus.* dependencies. Depends R (>= 3.1.0) Imports rebus.base (>= 0.0-3), rebus.datetimes, rebus.numbers, rebus.unicode (>= 0.0-2) Suggests testthat License Unlimited LazyLoad yes LazyData yes Acknowledgments Development of this package was partially funded by the Proteomics Core at Weill Cornell Medical College in Qatar <http://qatar-weill.cornell.edu>. The Core is supported by 'Biomedical Research Program' funds, a program funded by Qatar Foundation. RoxygenNote 6.0.1 Collate 'export-base.R' 'export-datetimes.R' 'export-numbers.R' 'export-unicode.R' 'imports.R' 'regex-package.R' NeedsCompilation no Repository CRAN Date/Publication 2017-04-25 21:42:46 UTC 1 2 Anchors R topics documented: Anchors . .2 as.regex . .3 Backreferences . .3 capture . .3 CharacterClasses . .3 char_class . .3 ClassGroups . .4 Concatenation . .4 DateTime . .4 escape_special . .4 exactly . .4 format.regex . .5 get_weekdays . .5 IsoClasses . .5 literal . .5 lookahead . .5 modify_mode . .6 number_range . .6 or ..............................................6 rebus.............................................6 recursive . .8 regex.............................................8 repeated . .8 ReplacementCase . .8 roman . .9 SpecialCharacters . .9 Unicode . .9 UnicodeGeneralCategory . .9 UnicodeOperators . .9 UnicodeProperty . 10 whole_word . 10 WordBoundaries . 10 Index 11 Anchors The start or end of a string Description See Anchors. as.regex 3 as.regex Convert or test for regex objects Description See as.regex. Backreferences Backreferences Description See Backreferences. capture Capture a token, or not Description See capture. CharacterClasses Class Constants Description See CharacterClasses. char_class A range or char_class of characters Description See char_class. 4 exactly ClassGroups Character classes Description See ClassGroups. Concatenation Combine strings together Description See Concatenation. DateTime Date-time regexes Description See DateTime. escape_special Escape special characters Description See escape_special. exactly Make a regex exact Description See exactly. format.regex 5 format.regex Print or format regex objects Description See format.regex. get_weekdays Get the days of the week or months of the year Description See get_weekdays. IsoClasses ISO 8601 date-time classes Description See IsoClasses. literal Treat part of a regular expression literally Description See literal. lookahead Lookaround Description See lookahead. 6 rebus modify_mode Apply mode modifiers Description See modify_mode. number_range Generate a regular expression for a number range Description See number_range. or Alternation Description See or. rebus rebus: Regular Expression Builder, Um, Something Description Build regular expressions in a human readable way. Details Regular expressions are a very powerful tool, but the syntax is terse enough to be difficult to read. This makes bugs easy to introduce, and hard to find. This package contains functions to make building regular expressions easier. Author(s) Richard Cotton <[email protected]> rebus 7 See Also regex and regexpr The ‘stringr‘ and ‘stringi‘ packages provide tools for matching regular expres- sions and nicely complement this package. http://www.regular-expressions.info has good advice on using regular expression in R. In particular, see http://www.regular-expressions. info/rlanguage.html and http://www.regular-expressions.info/examples.html https: //www.debuggex.com is a visual regex debugging and testing site. Examples ### Match a hex colour, like `"#99af01"` # This reads *Match a hash, followed by six hexadecimal values.* "#" %R% hex_digit(6) # To match only a hex colour and nothing else, you can add anchors to the # start and end of the expression. START %R% "#" %R% hex_digit(6) %R% END ### Simple email address matching. # This reads *Match one or more letters, numbers, dots, underscores, percents, # plusses or hyphens. Then match an 'at' symbol. Then match one or more letters, # numbers, dots, or hyphens. Then match a dot. Then match two to four letters.* one_or_more(char_class(ASCII_ALNUM %R% "._%+-")) %R% "@" %R% one_or_more(char_class(ASCII_ALNUM %R% ".-")) %R% DOT %R% ascii_alpha(2, 4) ### IP address matching. # First we need an expression to match numbers between 0 and 255. Both the # following syntaxes read *Match two then five then a number between zero and # five. Or match two then a number between zero and four then a digit. Or match # an optional zero or one followed by an optional digit folowed by a compulsory # digit. Make this a single token, but don't capture it.* # Using the %|% operator ip_element <- group( "25" %R% char_range(0, 5) %|% "2" %R% char_range(0, 4) %R% ascii_digit() %|% optional(char_class("01")) %R% optional(ascii_digit()) %R% ascii_digit() ) # The same again, this time using the or function ip_element <- or( "25" %R% char_range(0, 5), "2" %R% char_range(0, 4) %R% ascii_digit(), optional(char_class("01")) %R% optional(ascii_digit()) %R% ascii_digit() ) # It's easier to write using number_range, though it isn't quite as optimal 8 ReplacementCase # as handcrafted regexes. number_range(0, 255, allow_leading_zeroes = TRUE) # Now an IP address consists of 4 of these numbers separated by dots. This # reads *Match a word boundary. Then create a token from an `ip_element` # followed by a dot, and repeat it three times. Then match another `ip_element` # followed by a word boundary.* BOUNDARY %R% repeated(group(ip_element %R% DOT), 3) %R% ip_element %R% BOUNDARY recursive Make the regular expression recursive. Description See recursive. regex Create a regex Description See regex. repeated Repeat values Description See repeated. ReplacementCase Force the case of replacement values Description See ReplacementCase. roman 9 roman Roman numerals Description See roman. SpecialCharacters Special characters Description See SpecialCharacters. Unicode Unicode classes Description See Unicode. UnicodeGeneralCategory Unicode General Categories Description See UnicodeGeneralCategory. UnicodeOperators Unicode Operators Description See UnicodeOperators. 10 WordBoundaries UnicodeProperty Unicode Properties Description See UnicodeProperty. whole_word Match a whole word Description See whole_word. WordBoundaries Word boundaries Description See WordBoundaries. Index %R% (Concatenation),4 arabic_presentation_forms_b (Unicode),9 %c% (Concatenation),4 ARABIC_SUPPLEMENT (Unicode),9 arabic_supplement (Unicode),9 ADDITIONAL_ARROWS (Unicode),9 ARMENIAN (Unicode),9 additional_arrows (Unicode),9 armenian (Unicode),9 AEGEAN_NUMBERS (Unicode),9 ARMENIAN_LIGATURES (Unicode),9 aegean_numbers (Unicode),9 armenian_ligatures (Unicode),9 ALCHEMICAL_SYMBOLS (Unicode),9 as.regex, 3,3 alchemical_symbols (Unicode),9 as_lower (ReplacementCase),8 ALNUM (CharacterClasses),3 as_upper (ReplacementCase),8 alnum (ClassGroups),4 ASCII_ALNUM (CharacterClasses),3 ALPHA (CharacterClasses),3 ascii_alnum (ClassGroups),4 alpha (ClassGroups),4 ASCII_ALPHA (CharacterClasses),3 ALPHABETIC_PRESENTATION_FORMS ascii_alpha (ClassGroups),4 (Unicode),9 ASCII_DIGIT (CharacterClasses),3 alphabetic_presentation_forms ascii_digit (ClassGroups),4 (Unicode),9 ASCII_LOWER (CharacterClasses),3 AM_PM (DateTime),4 ascii_lower (ClassGroups),4 Anchors, 2,2 ASCII_UPPER (CharacterClasses),3 ANCIENT_GREEK_MUSICAL_NOTATION ascii_upper (ClassGroups),4 (Unicode),9 AVESTAN (Unicode),9 ancient_greek_musical_notation avestan (Unicode),9 (Unicode),9 ANCIENT_GREEK_NUMBERS (Unicode),9 Backreferences, 3,3 ancient_greek_numbers (Unicode),9 BACKSLASH (SpecialCharacters),9 ANCIENT_SYMBOLS (Unicode),9 BALINESE (Unicode),9 ancient_symbols (Unicode),9 balinese (Unicode),9 ANY_CHAR (CharacterClasses),3 BAMUN (Unicode),9 any_char (ClassGroups),4 bamun (Unicode),9 ARABIC (Unicode),9 BAMUN_SUPPLEMENT (Unicode),9 arabic (Unicode),9 bamun_supplement (Unicode),9 ARABIC_EXTENDED_A (Unicode),9 BASSA_VAH (Unicode),9 arabic_extended_a (Unicode),9 bassa_vah (Unicode),9 ARABIC_MATHEMATICAL_ALPHANUMERIC_SYMBOLS BATAK (Unicode),9 (Unicode),9 batak (Unicode),9 arabic_mathematical_alphanumeric_symbols BENGALI_AND_ASSAMESE (Unicode),9 (Unicode),9 bengali_and_assamese (Unicode),9 ARABIC_PRESENTATION_FORMS_A (Unicode),9 BLANK (CharacterClasses),3 arabic_presentation_forms_a (Unicode),9 blank (ClassGroups),4 ARABIC_PRESENTATION_FORMS_B (Unicode),9 BLOCK_ELEMENTS (Unicode),9 11 12 INDEX block_elements (Unicode),9 CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT BOPOMOFO (Unicode),9 (Unicode),9 bopomofo (Unicode),9 cjk_compatibility_ideographs_supplement BOPOMOFO_EXTENDED (Unicode),9 (Unicode),9 bopomofo_extended (Unicode),9 CJK_IDEOGRAPHIC_DESCRIPTION_CHARACTERS BOUNDARY (WordBoundaries), 10 (Unicode),9 BOX_DRAWING (Unicode),9 cjk_ideographic_description_characters box_drawing (Unicode),9 (Unicode),9 BRAHMI (Unicode),9 CJK_STROKES (Unicode),9 brahmi (Unicode),9 cjk_strokes (Unicode),9 BRAILLE_PATTERNS (Unicode),9 CJK_SYMBOLS_AND_PUNCTUATION (Unicode),9 braille_patterns (Unicode),9 cjk_symbols_and_punctuation (Unicode),9 BUGINESE (Unicode),9 CJK_UNIFIED_IDEOGRAPHS (Unicode),9 buginese (Unicode),9 cjk_unified_ideographs (Unicode),9 BUHID (Unicode),9 CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A buhid (Unicode),9 (Unicode),9 BYZANTINE_MUSICAL_SYMBOLS
Recommended publications
  • Background I. Names
    Background I. Names 1. China It used to be thought that the name ‘China’ derived from the name of China’s early Qin dynasty (Chin or Ch’in in older transcriptions), whose rulers conquered all rivals and initiated the dynasty in 221 BC. But, as Wilkinson notes (Chinese History: A Manual: 753, and fn 7), the original pronunciation of the name ‘Qin’ was rather different, and would make it an unlikely source for the name China. Instead, China is thought to derive from a Persian root, and was, apparently, first used for porcelain, and only later applied to the country from which the finest examples of that material came. Another name, Cathay, now rather poetic in English, but surviving as the regular name for the country in languages such as Russian (Kitai), is said to derive from the name of the Khitan Tarters, who formed the Liao dynasty in north China in the 10th century. The Khitan dynasty was the first to make a capital on the site of Beijing. The Chinese now call their country Zhōngguó, often translated as ‘Middle Kingdom’. Originally, this name meant the central – or royal – state of the many that occupied the region prior to the unification of Qin. Other names were used before Zhōngguó became current. One of the earliest was Huá (or Huáxià, combining Huá with the name of the earliest dynasty, the Xià). Xià, in fact, combined with the Zhōng of Zhōngguó, appears in the modern official names of the country, as we see below. 2. Chinese places a) The People’s Republic of China (PRC) [Zhōnghuá Rénmín Gònghéguó] This is the political entity proclaimed by Máo Zédōng when he gave his speech (‘China has risen again’) at the Gate of Heavenly Peace [Tiān’ān Mén] in Beijing on October 1, 1949.
    [Show full text]
  • UAX #44: Unicode Character Database File:///D:/Uniweb-L2/Incoming/08249-Tr44-3D1.Html
    UAX #44: Unicode Character Database file:///D:/Uniweb-L2/Incoming/08249-tr44-3d1.html Technical Reports L2/08-249 Working Draft for Proposed Update Unicode Standard Annex #44 UNICODE CHARACTER DATABASE Version Unicode 5.2 draft 1 Authors Mark Davis ([email protected]) and Ken Whistler ([email protected]) Date 2008-7-03 This Version http://www.unicode.org/reports/tr44/tr44-3.html Previous http://www.unicode.org/reports/tr44/tr44-2.html Version Latest Version http://www.unicode.org/reports/tr44/ Revision 3 Summary This annex consolidates information documenting the Unicode Character Database. Status This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress. A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part. Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports].
    [Show full text]
  • Tungumál, Letur Og Einkenni Hópa
    Tungumál, letur og einkenni hópa Er letur ómissandi í baráttu hópa við ríkjandi öfl? Frá Tifinagh og rúnaristum til Pixação Þorleifur Kamban Þrastarson Lokaritgerð til BA-prófs Listaháskóli Íslands Hönnunar- og arkitektúrdeild Desember 2016 Í þessari ritgerð er reynt að rökstyðja þá fullyrðingu að letur sé mikilvægt og geti jafnvel undir vissum kringumstæðum verið eitt mikilvægasta vopnið í baráttu hópa fyrir tilveru sinni, sjálfsmynd og stað í samfélagi. Með því að líta á þrjú ólík dæmi, Tifinagh, rúnaletur og Pixação, hvert frá sínum stað, menningarheimi og tímabili er ætlunin að sýna hvernig saga leturs og týpógrafíu hefur samtvinnast og mótast af samfélagi manna og haldist í hendur við einkenni þjóða og hópa fólks sem samsama sig á einn eða annan hátt. Einkenni hópa myndast oft sem andsvar við ytri öflum sem ógna menningu, auði eða tilverurétti hópsins. Hópar nota mismunandi leturtýpur til þess að tjá sig, tengjast og miðla upplýsingum. Það skiptir ekki eingöngu máli hvað þú skrifar heldur hvernig, með hvaða aðferðum og á hvaða efni. Skilaboðin eru fólgin í letrinu sjálfu en ekki innihaldi letursins. Letur er útlit upplýsingakerfis okkar og hefur notkun ritmáls og leturs aldrei verið meiri í heiminum sem og læsi. Ritmál og letur eru algjörlega samofnir hlutir og ekki hægt að slíta annað frá öðru. Ekki er hægt að koma frá sér ritmáli nema í letri og þessi tvö hugtök flækjast því oft saman. Í ljósi athugana á þessum þremur dæmum í ritgerðinni dreg ég þá ályktun að letur spilar og hefur spilað mikilvægt hlutverk í einkennum þjóða og hópa. Letur getur, ásamt tungumálinu, stuðlað að því að viðhalda, skapa eða eyðileggja menningu og menningarlegar tenginga Tungumál, letur og einkenni hópa Er letur ómissandi í baráttu hópa við ríkjandi öfl? Frá Tifinagh og rúnaristum til Pixação Þorleifur Kamban Þrastarson Lokaritgerð til BA-prófs í Grafískri hönnun Leiðbeinandi: Óli Gneisti Sóleyjarson Grafísk hönnun Hönnunar- og arkitektúrdeild Desember 2016 Ritgerð þessi er 6 eininga lokaritgerð til BA-prófs í Grafískri hönnun.
    [Show full text]
  • The Ogham-Runes and El-Mushajjar
    c L ite atu e Vo l x a t n t r n o . o R So . u P R e i t ed m he T a s . 1 1 87 " p r f ro y f r r , , r , THE OGHAM - RUNES AND EL - MUSHAJJAR A D STU Y . BY RICH A R D B URTO N F . , e ad J an uar 22 (R y , PART I . The O ham-Run es g . e n u IN tr ating this first portio of my s bj ect, the - I of i Ogham Runes , have made free use the mater als r John collected by Dr . Cha les Graves , Prof. Rhys , and other students, ending it with my own work in the Orkney Islands . i The Ogham character, the fair wr ting of ' Babel - loth ancient Irish literature , is called the , ’ Bethluis Bethlm snion e or , from its initial lett rs, like “ ” Gree co- oe Al hab e t a an d the Ph nician p , the Arabo “ ” Ab ad fl d H ebrew j . It may brie y be describe as f b ormed y straight or curved strokes , of various lengths , disposed either perpendicularly or obliquely to an angle of the substa nce upon which the letters n . were i cised , punched, or rubbed In monuments supposed to be more modern , the letters were traced , b T - N E E - A HE OGHAM RU S AND L M USH JJ A R . n not on the edge , but upon the face of the recipie t f n l o t sur ace ; the latter was origi al y wo d , s aves and tablets ; then stone, rude or worked ; and , lastly, metal , Th .
    [Show full text]
  • Schwa Deletion: Investigating Improved Approach for Text-To-IPA System for Shiri Guru Granth Sahib
    ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 International Journal of Advanced Research in Computer and Communication Engineering Vol. 4, Issue 4, April 2015 Schwa Deletion: Investigating Improved Approach for Text-to-IPA System for Shiri Guru Granth Sahib Sandeep Kaur1, Dr. Amitoj Singh2 Pursuing M.E, CSE, Chitkara University, India 1 Associate Director, CSE, Chitkara University , India 2 Abstract: Punjabi (Omniglot) is an interesting language for more than one reasons. This is the only living Indo- Europen language which is a fully tonal language. Punjabi language is an abugida writing system, with each consonant having an inherent vowel, SCHWA sound. This sound is modifiable using vowel symbols attached to consonant bearing the vowel. Shri Guru Granth Sahib is a voluminous text of 1430 pages with 511,874 words, 1,720,345 characters, and 28,534 lines and contains hymns of 36 composers written in twenty-two languages in Gurmukhi script (Lal). In addition to text being in form of hymns and coming from so many composers belonging to different languages, what makes the language of Shri Guru Granth Sahib even more different from contemporary Punjabi. The task of developing an accurate Letter-to-Sound system is made difficult due to two further reasons: 1. Punjabi being the only tonal language 2. Historical and Cultural circumstance/period of writings in terms of historical and religious nature of text and use of words from multiple languages and non-native phonemes. The handling of schwa deletion is of great concern for development of accurate/ near perfect system, the presented work intend to report the state-of-the-art in terms of schwa deletion for Indian languages, in general and for Gurmukhi Punjabi, in particular.
    [Show full text]
  • Localizing Into Chinese: the Two Most Common Questions White Paper Answered
    Localizing into Chinese: the two most common questions White Paper answered Different writing systems, a variety of languages and dialects, political and cultural sensitivities and, of course, the ever-evolving nature of language itself. ALPHA CRC LTD It’s no wonder that localizing in Chinese can seem complicated to the uninitiated. St Andrew’s House For a start, there is no single “Chinese” language to localize into. St Andrew’s Road Cambridge CB4 1DL United Kingdom Most Westerners referring to the Chinese language probably mean Mandarin; but @alpha_crc you should definitely not assume this as the de facto language for all audiences both within and outside mainland China. alphacrc.com To clear up any confusion, we talked to our regional language experts to find out the most definitive and useful answers to two of the most commonly asked questions when localizing into Chinese. 1. What’s the difference between Simplified Chinese and Traditional Chinese? 2. Does localizing into “Chinese” mean localizing into Mandarin, Cantonese or both? Actually, these are really pertinent questions because they get to the heart of some of the linguistic, political and cultural complexities that need to be taken into account when localizing for this region. Because of the important nature of these issues, we’ve gone a little more in depth than some of the articles on related themes elsewhere on the internet. We think you’ll find the answers a useful starting point for any considerations about localizing for the Chinese-language market. And, taking in linguistic nuances and cultural history, we hope you’ll find them an interesting read too.
    [Show full text]
  • A Translation of the Malia Altar Stone
    MATEC Web of Conferences 125, 05018 (2017) DOI: 10.1051/ matecconf/201712505018 CSCC 2017 A Translation of the Malia Altar Stone Peter Z. Revesz1,a 1 Department of Computer Science, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA Abstract. This paper presents a translation of the Malia Altar Stone inscription (CHIC 328), which is one of the longest known Cretan Hieroglyph inscriptions. The translation uses a synoptic transliteration to several scripts that are related to the Malia Altar Stone script. The synoptic transliteration strengthens the derived phonetic values and allows avoiding certain errors that would result from reliance on just a single transliteration. The synoptic transliteration is similar to a multiple alignment of related genomes in bioinformatics in order to derive the genetic sequence of a putative common ancestor of all the aligned genomes. 1 Introduction symbols. These attempts so far were not successful in deciphering the later two scripts. Cretan Hieroglyph is a writing system that existed in Using ideas and methods from bioinformatics, eastern Crete c. 2100 – 1700 BC [13, 14, 25]. The full Revesz [20] analyzed the evolutionary relationships decipherment of Cretan Hieroglyphs requires a consistent within the Cretan script family, which includes the translation of all known Cretan Hieroglyph texts not just following scripts: Cretan Hieroglyph, Linear A, Linear B the translation of some examples. In particular, many [6], Cypriot, Greek, Phoenician, South Arabic, Old authors have suggested translations for the Phaistos Disk, Hungarian [9, 10], which is also called rovásírás in the most famous and longest Cretan Hieroglyph Hungarian and also written sometimes as Rovas in inscription, but in general they were unable to show that English language publications, and Tifinagh.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Sinitic Language and Script in East Asia: Past and Present
    SINO-PLATONIC PAPERS Number 264 December, 2016 Sinitic Language and Script in East Asia: Past and Present edited by Victor H. Mair Victor H. Mair, Editor Sino-Platonic Papers Department of East Asian Languages and Civilizations University of Pennsylvania Philadelphia, PA 19104-6305 USA [email protected] www.sino-platonic.org SINO-PLATONIC PAPERS FOUNDED 1986 Editor-in-Chief VICTOR H. MAIR Associate Editors PAULA ROBERTS MARK SWOFFORD ISSN 2157-9679 (print) 2157-9687 (online) SINO-PLATONIC PAPERS is an occasional series dedicated to making available to specialists and the interested public the results of research that, because of its unconventional or controversial nature, might otherwise go unpublished. The editor-in-chief actively encourages younger, not yet well established, scholars and independent authors to submit manuscripts for consideration. Contributions in any of the major scholarly languages of the world, including romanized modern standard Mandarin (MSM) and Japanese, are acceptable. In special circumstances, papers written in one of the Sinitic topolects (fangyan) may be considered for publication. Although the chief focus of Sino-Platonic Papers is on the intercultural relations of China with other peoples, challenging and creative studies on a wide variety of philological subjects will be entertained. This series is not the place for safe, sober, and stodgy presentations. Sino- Platonic Papers prefers lively work that, while taking reasonable risks to advance the field, capitalizes on brilliant new insights into the development of civilization. Submissions are regularly sent out to be refereed, and extensive editorial suggestions for revision may be offered. Sino-Platonic Papers emphasizes substance over form.
    [Show full text]
  • Shahmukhi to Gurmukhi Transliteration System: a Corpus Based Approach
    Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach Tejinder Singh Saini1 and Gurpreet Singh Lehal2 1 Advanced Centre for Technical Development of Punjabi Language, Literature & Culture, Punjabi University, Patiala 147 002, Punjab, India [email protected] http://www.advancedcentrepunjabi.org 2 Department of Computer Science, Punjabi University, Patiala 147 002, Punjab, India [email protected] Abstract. This research paper describes a corpus based transliteration system for Punjabi language. The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and in Pakistan. This research project has developed a new system for the first time of its kind for Shahmukhi script of Punjabi language. The proposed system for Shahmukhi to Gurmukhi transliteration has been implemented with various research techniques based on language corpus. The corpus analysis program has been run on both Shahmukhi and Gurmukhi corpora for generating statistical data for different types like character, word and n-gram frequencies. This statistical analysis is used in different phases of transliteration. Potentially, all members of the substantial Punjabi community will benefit vastly from this transliteration system. 1 Introduction One of the great challenges before Information Technology is to overcome language barriers dividing the mankind so that everyone can communicate with everyone else on the planet in real time. South Asia is one of those unique parts of the world where a single language is written in different scripts. This is the case, for example, with Punjabi language spoken by tens of millions of people but written in Indian East Punjab (20 million) in Gurmukhi script (a left to right script based on Devanagari) and in Pakistani West Punjab (80 million), written in Shahmukhi script (a right to left script based on Arabic), and by a growing number of Punjabis (2 million) in the EU and the US in the Roman script.
    [Show full text]
  • Documenting Endangered Alphabets
    Documenting endangered alphabets Industry Focus Tim Brookes Three years ago, acting on a notion so whimsi- cal I assumed it was a kind of presenile monoma- nia, I began carving endangered alphabets. The Tdisclaimers start right away. I’m not a linguist, an anthropologist, a cultural historian or even a woodworker. I’m a writer — but I had recently started carving signs for friends and family, and I stumbled on Omniglot.com, an online encyclo- pedia of the world’s writing systems, and several things had struck me forcibly. For a start, even though the world has more than 6,000 Figure 1: Tifinagh. languages (some of which will be extinct even by the time this article goes to press), it has fewer than 100 scripts, and perhaps a passing them on as a series of items for consideration and dis- third of those are endangered. cussion. For example, what does a written language — any writ- Working with a set of gouges and a paintbrush, I started to ten language — look like? The Endangered Alphabets highlight document as many of these scripts as I could find, creating three this question in a number of interesting ways. As the forces of exhibitions and several dozen individual pieces that depicted globalism erode scripts such as these, the number of people who words, phrases, sentences or poems in Syriac, Bugis, Baybayin, can write them dwindles, and the range of examples of each Samaritan, Makassarese, Balinese, Javanese, Batak, Sui, Nom, script is reduced. My carvings may well be the only examples Cherokee, Inuktitut, Glagolitic, Vai, Bassa Vah, Tai Dam, Pahauh of, say, Samaritan script or Tifinagh that my visitors ever see.
    [Show full text]
  • Exhibition Checklist I. Creating Palmyra's Legacy
    EXHIBITION CHECKLIST 1. Caravan en route to Palmyra, anonymous artist after Louis-François Cassas, ca. 1799. Proof-plate etching. 15.5 x 27.3 in. (29.2 x 39.5 cm). The Getty Research Institute, 840011 I. CREATING PALMYRA'S LEGACY Louis-François Cassas Artist and Architect 2. Colonnade Street with Temple of Bel in background, Georges Malbeste and Robert Daudet after Louis-François Cassas. Etching. Plate mark: 16.9 x 36.6 in. (43 x 93 cm). From Voyage pittoresque de la Syrie, de la Phoénicie, de la Palestine, et de la Basse Egypte (Paris, ca. 1799), vol. 1, pl. 58. The Getty Research Institute, 840011 1 3. Architectural ornament from Palmyra tomb, Jean-Baptiste Réville and M. A. Benoist after Louis-François Cassas. Etching. Plate mark: 18.3 x 11.8 in. (28.5 x 45 cm). From Voyage pittoresque de la Syrie, de la Phoénicie, de la Palestine, et de la Basse Egypte (Paris, ca. 1799), vol. 1, pl. 137. The Getty Research Institute, 840011 4. Louis-François Cassas sketching outside of Homs before his journey to Palmyra (detail), Simon-Charles Miger after Louis-François Cassas. Etching. Plate mark: 8.4 x 16.1 in. (21.5 x 41cm). From Voyage pittoresque de la Syrie, de la Phoénicie, de la Palestine, et de la Basse Egypte (Paris, ca. 1799), vol. 1, pl. 20. The Getty Research Institute, 840011 5. Louis-François Cassas presenting gifts to Bedouin sheikhs, Simon Charles-Miger after Louis-François Cassas. Etching. Plate mark: 8.4 x 16.1 in. (21.5 x 41 cm).
    [Show full text]