Introduction to Japanese Computational Linguistics Francis Bond and Timothy Baldwin
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Hieroglyphs for the Information Age: Images As a Replacement for Characters for Languages Not Written in the Latin-1 Alphabet Akira Hasegawa
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-1-1999 Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet Akira Hasegawa Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Hasegawa, Akira, "Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet" (1999). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Hieroglyphs for the Information Age: Images as a Replacement for Characters for Languages not Written in the Latin- 1 Alphabet by Akira Hasegawa A thesis project submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Printing Management and Sciences in the College of Imaging Arts and Sciences of the Rochester Institute ofTechnology May, 1999 Thesis Advisor: Professor Frank Romano School of Printing Management and Sciences Rochester Institute ofTechnology Rochester, New York Certificate ofApproval Master's Thesis This is to certify that the Master's Thesis of Akira Hasegawa With a major in Graphic Arts Publishing has been approved by the Thesis Committee as satisfactory for the thesis requirement for the Master ofScience degree at the convocation of May 1999 Thesis Committee: Frank Romano Thesis Advisor Marie Freckleton Gr:lduate Program Coordinator C. -
Morphological Clues to the Relationships of Japanese and Korean
Morphological clues to the relationships of Japanese and Korean Samuel E. Martin 0. The striking similarities in structure of the Turkic, Mongolian, and Tungusic languages have led scholars to embrace the perennially premature hypothesis of a genetic relationship as the "Altaic" family, and some have extended the hypothesis to include Korean (K) and Japanese (J). Many of the structural similarities that have been noticed, however, are widespread in languages of the world and characterize any well-behaved language of the agglutinative type in which object precedes verb and all modifiers precede what is mod- ified. Proof of the relationships, if any, among these languages is sought by comparing words which may exemplify putative phono- logical correspondences that point back to earlier systems through a series of well-motivated changes through time. The recent work of John Whitman on Korean and Japanese is an excellent example of productive research in this area. The derivative morphology, the means by which the stems of many verbs and nouns were created, appears to be largely a matter of developments in the individual languages, though certain formants have been proposed as putative cognates for two or more of these languages. Because of the relative shortness of the elements involved and the difficulty of pinning down their semantic functions (if any), we do well to approach the study of comparative morphology with caution, reconstructing in depth the earliest forms of the vocabulary of each language before indulging in freewheeling comparisons outside that domain. To a lesser extent, that is true also of the grammatical morphology, the affixes or particles that mark words as participants in the phrases, sentences (either overt clauses or obviously underlying proposi- tions), discourse blocks, and situational frames of reference that constitute the creative units of language use. -
Noun Group and Verb Group Identification for Hindi
Noun Group and Verb Group Identification for Hindi Smriti Singh1, Om P. Damani2, Vaijayanthi M. Sarma2 (1) Insideview Technologies (India) Pvt. Ltd., Hyderabad (2) Indian Institute of Technology Bombay, Mumbai, India [email protected], [email protected], [email protected] ABSTRACT We present algorithms for identifying Hindi Noun Groups and Verb Groups in a given text by using morphotactical constraints and sequencing that apply to the constituents of these groups. We provide a detailed repertoire of the grammatical categories and their markers and an account of their arrangement. The main motivation behind this work on word group identification is to improve the Hindi POS Tagger’s performance by including strictly contextual rules. Our experiments show that the introduction of group identification rules results in improved accuracy of the tagger and in the resolution of several POS ambiguities. The analysis and implementation methods discussed here can be applied straightforwardly to other Indian languages. The linguistic features exploited here are drawn from a range of well-understood grammatical features and are not peculiar to Hindi alone. KEYWORDS : POS tagging, chunking, noun group, verb group. Proceedings of COLING 2012: Technical Papers, pages 2491–2506, COLING 2012, Mumbai, December 2012. 2491 1 Introduction Chunking (local word grouping) is often employed to reduce the computational effort at the level of parsing by assigning partial structure to a sentence. A typical chunk, as defined by Abney (1994:257) consists of a single content word surrounded by a constellation of function words, matching a fixed template. Chunks, in computational terms are considered the truncated versions of typical phrase-structure grammar phrases that do not include arguments or adjuncts (Grover and Tobin 2006). -
SUPPORTING the CHINESE, JAPANESE, and KOREAN LANGUAGES in the OPENVMS OPERATING SYSTEM by Michael M. T. Yau ABSTRACT the Asian L
SUPPORTING THE CHINESE, JAPANESE, AND KOREAN LANGUAGES IN THE OPENVMS OPERATING SYSTEM By Michael M. T. Yau ABSTRACT The Asian language versions of the OpenVMS operating system allow Asian-speaking users to interact with the OpenVMS system in their native languages and provide a platform for developing Asian applications. Since the OpenVMS variants must be able to handle multibyte character sets, the requirements for the internal representation, input, and output differ considerably from those for the standard English version. A review of the Japanese, Chinese, and Korean writing systems and character set standards provides the context for a discussion of the features of the Asian OpenVMS variants. The localization approach adopted in developing these Asian variants was shaped by business and engineering constraints; issues related to this approach are presented. INTRODUCTION The OpenVMS operating system was designed in an era when English was the only language supported in computer systems. The Digital Command Language (DCL) commands and utilities, system help and message texts, run-time libraries and system services, and names of system objects such as file names and user names all assume English text encoded in the 7-bit American Standard Code for Information Interchange (ASCII) character set. As Digital's business began to expand into markets where common end users are non-English speaking, the requirement for the OpenVMS system to support languages other than English became inevitable. In contrast to the migration to support single-byte, 8-bit European characters, OpenVMS localization efforts to support the Asian languages, namely Japanese, Chinese, and Korean, must deal with a more complex issue, i.e., the handling of multibyte character sets. -
Title the Practice of Basic Informatics 2019 Author(S) Kita, Hajime
Title The Practice of Basic Informatics 2019 Kita, Hajime; Kitamura, Yumi; Hioki, Hirohisa; Sakai, Author(s) Hiroyuki; Lin, Donghui Citation (2020): 1-196 Issue Date 2020-03-08 URL http://hdl.handle.net/2433/246166 This book is licensed under CC-BY-NC-ND. For detail, access Right the following: https://creativecommons.org/licenses/by-nc- nd/4.0/deed.en Type Learning Material Textversion publisher Kyoto University The Practice of Basic Informatics 2019 Hajime Kita, Institute for Liberal Arts and Sciences, Yumi Kitamura, Kyoto University Library, Hirohisa Hioki, Graduate School of Human and Environmental Studies, Hiroyuki Sakai, Center for the Promotion of Excellence in Higher Education, Donghui Lin, Graduate School of Informatics Kyoto University Version 2020/03/08 0. Foreword Table of Contents 0. Foreword Kyoto University provides courses on ‘The Practice of Basic Informatics’ as part of its Liberal Arts and Sciences Program. The course is taught at many schools and departments, and course contents vary to meet the requirements of these schools and departments. This textbook is made open to the students of all schools that teach these courses. As stated in Chapter 1, this book is written with the aim of building ICT skills for study at university, that is, ICT skills for academic activities. Some topics may not be taught in class. However, the book is written for self-study by students. We include many exercises in this textbook so that instructors can select some of them for their classes, to accompany their teaching plans. The courses are given at the computer laboratories of the university, and the contents of this textbook assume that Windows 10 and Microsoft Office 2016 are available in these laboratories. -
An Analysis of the Content Words Used in a School Textbook, Team up English 3, Used for Grade 9 Students
[Pijarnsarid et. al., Vol.5 (Iss.3): March, 2017] ISSN- 2350-0530(O), ISSN- 2394-3629(P) ICV (Index Copernicus Value) 2015: 71.21 IF: 4.321 (CosmosImpactFactor), 2.532 (I2OR) InfoBase Index IBI Factor 3.86 Social AN ANALYSIS OF THE CONTENT WORDS USED IN A SCHOOL TEXTBOOK, TEAM UP ENGLISH 3, USED FOR GRADE 9 STUDENTS Sukontip Pijarnsarid*1, Prommintra Kongkaew2 *1, 2English Department, Graduate School, Ubon Ratchathani Rajabhat University, Thailand DOI: https://doi.org/10.29121/granthaalayah.v5.i3.2017.1761 Abstract The purpose of this study were to study the content words used in a school textbook, Team Up in English 3, used for Grade 9 students and to study the frequency of content words used in a school textbook, Team Up in English 3, used for Grade 9 students. The study found that nouns is used with the highest frequency (79), followed by verb (58), adjective (46), and adverb (24).With the nouns analyzed, it was found that the Modifiers + N used with the highest frequency (92.40%), the compound nouns were ranked in second (7.59 %). Considering the verbs used in the text, it was found that transitive verbs were most commonly used (77.58%), followed by intransitive verbs (12.06%), linking verbs (10.34%). As regards the adjectives used in the text, there were 46 adjectives in total, 30 adjectives were used as attributive (65.21 %) and 16 adjectives were used as predicative (34.78%). As for the adverbs, it was found that adverbs of times were used with the highest frequency (37.5 % ), followed by the adverbs of purpose and degree (33.33%) , the adverbs of frequency (12.5 %) , the adverbs of place ( 8.33% ) and the adverbs of manner ( 8.33 % ). -
2 Mora and Syllable
2 Mora and Syllable HARUO KUBOZONO 0 Introduction In his classical typological study Trubetzkoy (1969) proposed that natural languages fall into two groups, “mora languages” and “syllable languages,” according to the smallest prosodic unit that is used productively in that lan- guage. Tokyo Japanese is classified as a “mora” language whereas Modern English is supposed to be a “syllable” language. The mora is generally defined as a unit of duration in Japanese (Bloch 1950), where it is used to measure the length of words and utterances. The three words in (1), for example, are felt by native speakers of Tokyo Japanese as having the same length despite the different number of syllables involved. In (1) and the rest of this chapter, mora boundaries are marked by hyphens /-/, whereas syllable boundaries are marked by dots /./. All syllable boundaries are mora boundaries, too, although not vice versa.1 (1) a. to-o-kyo-o “Tokyo” b. a-ma-zo-n “Amazon” too.kyoo a.ma.zon c. a-me-ri-ka “America” a.me.ri.ka The notion of “mora” as defined in (1) is equivalent to what Pike (1947) called a “phonemic syllable,” whereas “syllable” in (1) corresponds to what he referred to as a “phonetic syllable.” Pike’s choice of terminology as well as Trubetzkoy’s classification may be taken as implying that only the mora is relevant in Japanese phonology and morphology. Indeed, the majority of the literature emphasizes the importance of the mora while essentially downplaying the syllable (e.g. Sugito 1989, Poser 1990, Tsujimura 1996b). This symbolizes the importance of the mora in Japanese, but it remains an open question whether the syllable is in fact irrelevant in Japanese and, if not, what role this second unit actually plays in the language. -
Handy Katakana Workbook.Pdf
First Edition HANDY KATAKANA WORKBOOK An Introduction to Japanese Writing: KANA THIS IS A SUPPLEMENT FOR BEGINNING LEVEL JAPANESE LANGUAGE INSTRUCTION. \ FrF!' '---~---- , - Y. M. Shimazu, Ed.D. -----~---- TABLE OF CONTENTS Page Introduction vi ACKNOWLEDGEMENlS vii STUDYSHEET#l 1 A,I,U,E, 0, KA,I<I, KU,KE, KO, GA,GI,GU,GE,GO, N WORKSHEET #1 2 PRACTICE: A, I,U, E, 0, KA,KI, KU,KE, KO, GA,GI,GU, GE,GO, N WORKSHEET #2 3 MORE PRACTICE: A, I, U, E,0, KA,KI,KU, KE, KO, GA,GI,GU,GE,GO, N WORKSHEET #~3 4 ADDmONAL PRACTICE: A,I,U, E,0, KA,KI, KU,KE, KO, GA,GI,GU,GE,GO, N STUDYSHEET #2 5 SA,SHI,SU,SE, SO, ZA,JI,ZU,ZE,ZO, TA, CHI, TSU, TE,TO, DA, DE,DO WORI<SHEEI' #4 6 PRACTICE: SA,SHI,SU,SE, SO, ZA,II, ZU,ZE,ZO, TA, CHI, 'lSU,TE,TO, OA, DE,DO WORI<SHEEI' #5 7 MORE PRACTICE: SA,SHI,SU,SE,SO, ZA,II, ZU,ZE, W, TA, CHI, TSU, TE,TO, DA, DE,DO WORKSHEET #6 8 ADDmONAL PRACI'ICE: SA,SHI,SU,SE, SO, ZA,JI, ZU,ZE,ZO, TA, CHI,TSU,TE,TO, DA, DE,DO STUDYSHEET #3 9 NA,NI, NU,NE,NO, HA, HI,FU,HE, HO, BA, BI,BU,BE,BO, PA, PI,PU,PE,PO WORKSHEET #7 10 PRACTICE: NA,NI, NU, NE,NO, HA, HI,FU,HE,HO, BA,BI, BU,BE, BO, PA, PI,PU,PE,PO WORKSHEET #8 11 MORE PRACTICE: NA,NI, NU,NE,NO, HA,HI, FU,HE, HO, BA,BI,BU,BE, BO, PA,PI,PU,PE,PO WORKSHEET #9 12 ADDmONAL PRACTICE: NA,NI, NU, NE,NO, HA, HI, FU,HE, HO, BA,BI,3U, BE, BO, PA, PI,PU,PE,PO STUDYSHEET #4 13 MA, MI,MU, ME, MO, YA, W, YO WORKSHEET#10 14 PRACTICE: MA,MI, MU,ME, MO, YA, W, YO WORKSHEET #11 15 MORE PRACTICE: MA, MI,MU,ME,MO, YA, W, YO WORKSHEET #12 16 ADDmONAL PRACTICE: MA,MI,MU, ME, MO, YA, W, YO STUDYSHEET #5 17 -
A Student Model of Katakana Reading Proficiency for a Japanese Language Intelligent Tutoring System
A Student Model of Katakana Reading Proficiency for a Japanese Language Intelligent Tutoring System Anthony A. Maciejewski Yun-Sun Kang School of Electrical Engineering Purdue University West Lafayette, Indiana 47907 Abstract--Thls work describes the development of a kanji. Due to the limited number of kana, their relatively low student model that Is used In a Japanese language visual complexity, and their systematic arrangement they do Intelligent tutoring system to assess a pupil's not represent a significant barrier to the student of Japanese. proficiency at reading one of the distinct In fact, the relatively small effort required to learn katakana orthographies of Japanese, known as k a t a k a n a , yields significant returns to readers of technical Japanese due to While the effort required to memorize the relatively the high incidence of terms derived from English and few k a t a k a n a symbols and their associated transliterated into katakana. pronunciations Is not prohibitive, a major difficulty In reading katakana Is associated with the phonetic This work describes the development of a system that is modifications which occur when English words used to automatically acquire knowledge about how English which are transliterated Into katakana are made to words are transliterated into katakana and then to use that conform to the more restrictive rules of Japanese information in developing a model of a student's proficiency in phonology. The algorithm described here Is able to reading katakana. This model is used to guide the instruction automatically acquire a knowledge base of these of the student using an intelligent tutoring system developed phonological transformation rules, use them to previously [6,8]. -
The Languages of Japan and Korea. London: Routledge 2 The
To appear in: Tranter, David N (ed.) The Languages of Japan and Korea. London: Routledge 2 The relationship between Japanese and Korean John Whitman 1. Introduction This chapter reviews the current state of Japanese-Ryukyuan and Korean internal reconstruction and applies the results of this research to the historical comparison of both families. Reconstruction within the families shows proto-Japanese-Ryukyuan (pJR) and proto-Korean (pK) to have had very similar phonological inventories, with no laryngeal contrast among consonants and a system of six or seven vowels. The main challenges for the comparativist are working through the consequences of major changes in root structure in both languages, revealed or hinted at by internal reconstruction. These include loss of coda consonants in Japanese, and processes of syncope and medial consonant lenition in Korean. The chapter then reviews a small number (50) of pJR/pK lexical comparisons in a number of lexical domains, including pronouns, numerals, and body parts. These expand on the lexical comparisons proposed by Martin (1966) and Whitman (1985), in some cases responding to the criticisms of Vovin (2010). It identifies a small set of cognates between pJR and pK, including approximately 13 items on the standard Swadesh 100 word list: „I‟, „we‟, „that‟, „one‟, „two‟, „big‟, „long‟, „bird‟, „tall/high‟, „belly‟, „moon‟, „fire‟, „white‟ (previous research identifies several more cognates on this list). The paper then concludes by introducing a set of cognate inflectional morphemes, including the root suffixes *-i „infinitive/converb‟, *-a „infinitive/irrealis‟, *-or „adnominal/nonpast‟, and *-ko „gerund.‟ In terms of numbers of speakers, Japanese-Ryukyuan and Korean are the largest language isolates in the world. -
Gemination As Fortition?
Gemination as fortition? Articulatory data from Hungarian Maida Percival, Tamás Gábor Csapó, Márton Bartók, Andrea Deme, Tekla Etelka Gráczi, Alexandra Markó Introduction: This study uses ultrasound imaging to investigate how tongue position differs across singleton and geminate consonants in Hungarian. In previous research, coronal geminates were found to be produced with greater lingual-palatal contact and a higher and flatter tongue (Kochetov & Kang, 2017; Payne, 2006). In Eastern Oromo, ultrasound imaging found similarly results that the tongue was more advanced in the mouth for coronal geminates, suggesting that it was more fully reaching its targeted place of articulation than singletons (Percival et al., 2019). These findings liken gemination with fortition. We follow up on these studies by examining whether there is evidence of differences in lingual articulation in geminates compared to singletons in Hungarian. As previous studies concentrated on coronal stops, we additionally ask if similar patterns of tongue raising or fronting can be found for geminates at other places of articulation as this could indicate how closely the pattern is tied to gemination in general versus a tongue pull mechanism limited to coronals. Methodology: Five native speakers of Hungarian (3 female, 2 male) were recorded with ultrasound and audio in Articulate Assistant Advanced (AAA). They read six repetitions of a word list containing geminate and singleton voiced and voiceless bilabial stops, alveolar stops, alveolar fricatives, and velar stops in intervocalic (post-tonic) position. Ultrasound frames at the point of maximum constriction were selected, and tongue contours were traced in AAA. The tongue contours were rotated to the speaker’s bite plane and divided into three regions (coronal, velar, and pharyngeal) (c.f. -
On the Theoretical Implications of Cypriot Greek Initial Geminates
<LINK "mul-n*">"mul-r16">"mul-r8">"mul-r19">"mul-r14">"mul-r27">"mul-r7">"mul-r6">"mul-r17">"mul-r2">"mul-r9">"mul-r24"> <TARGET "mul" DOCINFO AUTHOR "Jennifer S. Muller"TITLE "On the theoretical implications of Cypriot Greek initial geminates"SUBJECT "JGL, Volume 3"KEYWORDS "geminates, representation, phonology, Cypriot Greek"SIZE HEIGHT "220"WIDTH "150"VOFFSET "4"> On the theoretical implications of Cypriot Greek initial geminates* Jennifer S. Muller The Ohio State University Cypriot Greek contrasts singleton and geminate consonants in word-initial position. These segments are of particular interest to phonologists since two divergent representational frameworks, moraic theory (Hayes 1989) and timing-based frameworks, including CV or X-slot theory (Clements and Keyser 1983, Levin 1985), account for the behavior of initial geminates in substantially different ways. The investigation of geminates in Cypriot Greek allows these differences to be explored. As will be demonstrated in a formal analysis of the facts, the patterning of geminates in Cypriot is best accounted for by assuming that the segments are dominated by abstract timing units such as X- or C-slots, rather than by a unit of prosodic weight such as the mora. Keywords: geminates, representation, phonology, Cypriot Greek 1. Introduction Cypriot Greek is of particular interest, not only because it is one of the few varieties of Modern Greek maintaining a consonant length contrast, but more importantly because it exhibits this contrast in word-initial position: péfti ‘Thursday’ vs. ppéfti ‘he falls’.Although word-initial geminates are less common than their word-medial counterparts, they are attested in dozens of the world’s languages in addition to Cypriot Greek.