An Unsupervised Method for Identifying Loanwords in Korean

Total Page:16

File Type:pdf, Size:1020Kb

An Unsupervised Method for Identifying Loanwords in Korean An Unsupervised Method for Identifying Loanwords in Korean Hahn Koo San Jose State University [email protected] Manuscript to appear in Language Resources and Evaluation The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-015-9296-5 Loanword Identification in Korean Abstract This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Ex- pectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is deter- mined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a su- pervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77% and 96.67%, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese. Keywords: Loanwords; Transliteration; Detection; N-gram; EM algorithm; Korean 1 Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from words in a foreign language. Their forms, both pronunciation and spelling, are often nativized. Their pronunciations adapt to conform to native sound patterns. Their spellings are transliterated using the native script and re- flect the adapted pronunciations. For example, flask [flæsk] in English be- comes 플라스크 [pʰɨl.ɾa.sɨ.kʰɨ] in Korean. The present paper is concerned with building a system that scans Korean text and identifies loanwords1 spelled in Hangul, the Korean alphabet. Such a system can be useful in many ways. First, one can use the system to collect data to study various aspects of loanwords (e.g. Haspelmath and Tadmor, 2009) or develop ma- chine transliteration systems (e.g. Knight and Graehl, 1998; Ravi and Knight, 2009). Loanwords or transliterations (e.g. 플라스크) can be extracted from monolingual corpora by running the system alone. Transliteration pairs (e.g. <flask, 플라스크>) can be extracted from parallel corpora by first identi- fying the output with the system and then matching input forms based on scoring heuristics such as phonetic similarity (e.g. Yoon et al., 2007). Second, the system allows one to use etymological origins of words as a feature and be more discrete in text processing. For example, grapheme-to-phoneme con- version in Korean (Yoon and Brew, 2006) and stemming in Arabic (Nwesri, 2008) can be improved by keeping separate rules for native words and loan- words. The system can be used to classify a given word into either category and apply the proper set of rules. The loanword identification system envisioned here is a binary, character- based n-gram classifier. Given a word (w) spelled in Hangul, the classifier decides whether the word is of native (N) or foreign (F ) origin by Bayesian classification, i.e. solving the following equation: c^(w) = arg max P (wjc) · P (c) (1) c2fN;F g The likelihood P (wjc) is calculated using a character n-gram model specific 1In this paper, loanwords in Korean refer to all words of foreign origin that are translit- erated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn, 1999). 2 Loanword Identification in Korean to that class. The classifier is trained on a corpus in an unsupervised manner, building on seed words extracted from the corpus. The native seed consists of words with high token frequency in the corpus. The idea is that frequent words are more likely to be native words than foreign words. The foreign seed consists of words that contain what appear to be traces of vowel inser- tion. Korean does not have words that begin or end with consonant clusters. Like many other languages with similar phonotactics (e.g. Japanese), for- eign words with consonant clusters are transliterated with vowels inserted to break the clusters. So presence of substrings that resemble traces of inser- tion suggests that a word may be of foreign origin. An obvious problem is deciding what those traces look like a priori. Here the problem is resolved by a heuristic based on phoneme co-occurrence statistics and rudimentary ideas and findings in phonology. The rest of the paper is organized as follows. In Section 2, I discuss pre- vious studies in foreign word identification as well as ideas and findings in phonology that the present study builds on. I describe the proposed method for developing the unsupervised classifier in detail in Section 3. I discuss ex- periments that evaluate the effectiveness of the method in Korean in Section 4 and pilot experiments in Japanese that explore its applicability to other languages in Section 5. I conclude the paper in Section 6. 2 Background This work is motivated by previous studies on identifying loanwords or for- eign words in monolingual data. Many of them rely on the assumption that distribution of strings of sublexical units such as phonemes, letters, and syl- lables differs between words of different origins. Some write explicit and categorical rules stating which substrings are characteristic of foreign words (e.g. Bali et al., 2007; Khaltar and Fujii, 2009). Some train letter or syllable n-gram models separately for native words and foreign words and compare the two. It has been shown that the n-gram approach can be very effective in Korean (e.g. Jeong et al., 1999; Oh and Choi, 2001). Training the n-gram models is straightforward with labeled data in which words are tagged either native or foreign. But creating labeled data can be expensive and tedious. In response, some have proposed methods for generat- 3 Loanword Identification in Korean ing pseudo-annotated data: Baker and Brew (2008) for Korean and Goldberg and Elhadad (2008) for Hebrew. In both studies, the authors suggest gener- ating pseudo-loanwords by applying transliteration rules to a foreign lexicon such as the CMU Pronouncing Dictionary. They suggest different methods for generating pseudo-native words. Baker and Brew extract words with high token frequencies in a Korean newswire corpus assuming that frequent words are more likely to be native than foreign. Goldberg and Elhadad extracted words from a collection of old Hebrew texts assuming that old texts are much less likely to contain foreign words than recent texts. The approach is effective and a classifier trained on the pseudo-labeled data can perform comparably to a classifier trained on manually labeled data. Baker and Brew trained a logistic regression classifier using letter trigrams on about 180,000 pseudo-words, half pseudo-Korean and half pseudo-English. Tested on a la- beled set of 10,000 native Korean words and 10,000 English loanwords, the classifier showed 92.4% classification accuracy. In comparison, the corre- sponding classifier trained on manually labeled data showed 96.2% accuracy in a 10-fold cross-validation experiment. The pseudo-annotation approach obviates the need to manually label data. But one has to write a separate set of transliteration rules for every pair of languages. In addition, the transliteration rules may not be available to begin with, if the very purpose of identifying loanwords is to collect training data for machine transliteration. The foreign seed extraction method proposed in the present study is an attempt to reduce the level of language-specificity and demand for additional natural language processing capabilities. The method essentially equips one with a subset of transliteration rules by presuppos- ing a generic pattern in pronunciation change, i.e. vowel insertion. The method should be applicable to many language pairs. The need to repair consonant clusters arises for many language pairs and vowel insertion is a re- pair strategy adopted in many languages. Foreign sound sequences that are phonotactically illegal in the native language are usually repaired rather than overlooked. A common source of phonotactic discrepancy involves consonant clusters: different languages allow consonant clusters of different complex- ity. Maddieson (2013) identifies 151 languages that allow a wide variety of consonant clusters, 274 languages that allow only a highly restricted set of clusters, and 61 languages that do not allow clusters at all. Illegal clusters are repaired by vowel insertion or consonant deletion, but vowel insertion appears to be cross-linguistically more common (Kang, 2011). 4 Loanword Identification in Korean The vowel insertion pattern is initially characterized only generically as ‘in- sert vowel X in position Y to repair consonant cluster Z’. The generic nature of the characterization ensures language-neutrality. But in order for the pat- tern to be of any use, one must eventually flesh out the details and provide instances of the pattern equivalent to specific transliteration rules: ‘insert [u] between the consonants to repair [sm]’ or [sm] ! [sum], for example. Here the language-specific details of vowel insertion are discovered from a corpus in a data-driven manner but the search process is guided by findings and ideas in phonology. As will be described in detail below, possible values of which vowel is inserted where are constrained based on typological studies of loanword adaptation (e.g. Kang, 2011) and vowel insertion (e.g.
Recommended publications
  • Man'yogana.Pdf (574.0Kb)
    Bulletin of the School of Oriental and African Studies http://journals.cambridge.org/BSO Additional services for Bulletin of the School of Oriental and African Studies: Email alerts: Click here Subscriptions: Click here Commercial reprints: Click here Terms of use : Click here The origin of man'yogana John R. BENTLEY Bulletin of the School of Oriental and African Studies / Volume 64 / Issue 01 / February 2001, pp 59 ­ 73 DOI: 10.1017/S0041977X01000040, Published online: 18 April 2001 Link to this article: http://journals.cambridge.org/abstract_S0041977X01000040 How to cite this article: John R. BENTLEY (2001). The origin of man'yogana. Bulletin of the School of Oriental and African Studies, 64, pp 59­73 doi:10.1017/S0041977X01000040 Request Permissions : Click here Downloaded from http://journals.cambridge.org/BSO, IP address: 131.156.159.213 on 05 Mar 2013 The origin of man'yo:gana1 . Northern Illinois University 1. Introduction2 The origin of man'yo:gana, the phonetic writing system used by the Japanese who originally had no script, is shrouded in mystery and myth. There is even a tradition that prior to the importation of Chinese script, the Japanese had a native script of their own, known as jindai moji ( , age of the gods script). Christopher Seeley (1991: 3) suggests that by the late thirteenth century, Shoku nihongi, a compilation of various earlier commentaries on Nihon shoki (Japan's first official historical record, 720 ..), circulated the idea that Yamato3 had written script from the age of the gods, a mythical period when the deity Susanoo was believed by the Japanese court to have composed Japan's first poem, and the Sun goddess declared her son would rule the land below.
    [Show full text]
  • Consonant Characters and Inherent Vowels
    Global Design: Characters, Language, and More Richard Ishida W3C Internationalization Activity Lead Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 1 Getting more information W3C Internationalization Activity http://www.w3.org/International/ Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 2 Outline Character encoding: What's that all about? Characters: What do I need to do? Characters: Using escapes Language: Two types of declaration Language: The new language tag values Text size Navigating to localized pages Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 3 Character encoding Character encoding: What's that all about? Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 4 Character encoding The Enigma Photo by David Blaikie Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 5 Character encoding Berber 4,000 BC Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 6 Character encoding Tifinagh http://www.dailymotion.com/video/x1rh6m_tifinagh_creation Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 7 Character encoding Character set Character set ⴰ ⴱ ⴲ ⴳ ⴴ ⴵ ⴶ ⴷ ⴸ ⴹ ⴺ ⴻ ⴼ ⴽ ⴾ ⴿ ⵀ ⵁ ⵂ ⵃ ⵄ ⵅ ⵆ ⵇ ⵈ ⵉ ⵊ ⵋ ⵌ ⵍ ⵎ ⵏ ⵐ ⵑ ⵒ ⵓ ⵔ ⵕ ⵖ ⵗ ⵘ ⵙ ⵚ ⵛ ⵜ ⵝ ⵞ ⵟ ⵠ ⵢ ⵣ ⵤ ⵥ ⵯ Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 8 Character encoding Coded character set 0 1 2 3 0 1 Coded character set 2 3 4 5 6 7 8 9 33 (hexadecimal) A B 52 (decimal) C D E F Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 9 Character encoding Code pages ASCII Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 10 Character encoding Code pages ISO 8859-1 (Latin 1) Western Europe ç (E7) Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 11 Character encoding Code pages ISO 8859-7 Greek η (E7) Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 12 Character encoding Double-byte characters Standard Country No.
    [Show full text]
  • SUPPORTING the CHINESE, JAPANESE, and KOREAN LANGUAGES in the OPENVMS OPERATING SYSTEM by Michael M. T. Yau ABSTRACT the Asian L
    SUPPORTING THE CHINESE, JAPANESE, AND KOREAN LANGUAGES IN THE OPENVMS OPERATING SYSTEM By Michael M. T. Yau ABSTRACT The Asian language versions of the OpenVMS operating system allow Asian-speaking users to interact with the OpenVMS system in their native languages and provide a platform for developing Asian applications. Since the OpenVMS variants must be able to handle multibyte character sets, the requirements for the internal representation, input, and output differ considerably from those for the standard English version. A review of the Japanese, Chinese, and Korean writing systems and character set standards provides the context for a discussion of the features of the Asian OpenVMS variants. The localization approach adopted in developing these Asian variants was shaped by business and engineering constraints; issues related to this approach are presented. INTRODUCTION The OpenVMS operating system was designed in an era when English was the only language supported in computer systems. The Digital Command Language (DCL) commands and utilities, system help and message texts, run-time libraries and system services, and names of system objects such as file names and user names all assume English text encoded in the 7-bit American Standard Code for Information Interchange (ASCII) character set. As Digital's business began to expand into markets where common end users are non-English speaking, the requirement for the OpenVMS system to support languages other than English became inevitable. In contrast to the migration to support single-byte, 8-bit European characters, OpenVMS localization efforts to support the Asian languages, namely Japanese, Chinese, and Korean, must deal with a more complex issue, i.e., the handling of multibyte character sets.
    [Show full text]
  • Problematika České Transkripce Japonštiny a Pravidla Jejího Užívání Nihongo 日本語 社会 Šakai Fudži 富士 Ivona Barešová, Monika Dytrtová Momidži Tókjó もみ じ 東京 Čotto ち ょっ と
    チェ コ語 翼 čekogo cubasa šin’jó 信用 Problematika české transkripce japonštiny a pravidla jejího užívání nihongo 日本語 社会 šakai Fudži 富士 Ivona Barešová, Monika Dytrtová momidži Tókjó もみ じ 東京 čotto ち ょっ と PROBLEMATIKA ČESKÉ TRANSKRIPCE JAPONŠTINY A PRAVIDLA JEJÍHO UŽÍVÁNÍ Ivona Barešová Monika Dytrtová Spolupracovala: Bc. Mária Ševčíková Oponenti: prof. Zdeňka Švarcová, Dr. Mgr. Jiří Matela, M.A. Tento výzkum byl umožněn díky účelové podpoře na specifi cký vysokoškolský výzkum udělené roku 2013 Univerzitě Palackého v Olomouci Ministerstvem školství, mládeže a tělovýchovy ČR (FF_2013_044). Neoprávněné užití tohoto díla je porušením autorských práv a může zakládat občanskoprávní, správněprávní, popř. trestněprávní odpovědnost. 1. vydání © Ivona Barešová, Monika Dytrtová, 2014 © Univerzita Palackého v Olomouci, 2014 ISBN 978-80-244-4017-0 Ediční poznámka V publikaci je pro přepis japonských slov užito české transkripce, s výjimkou jejich výskytu v anglic kém textu, kde se objevuje původní transkripce anglická. Japonská jména jsou uváděna v pořadí jméno – příjmení, s výjimkou jejich uvedení v bibliogra- fi ckých citacích, kde se jejich pořadí řídí citační normou. Pokud není uvedeno jinak, autorkami překladů citací a parafrází jsou autorky této práce. V textu jsou použity následující grafi cké prostředky. Fonologický zápis je uveden standardně v šikmých závorkách // a fonetický v hranatých []. Fonetické symboly dle IPA jsou použity v rámci kapi toly 5. V případě, že není nutné zaznamenávat jemné zvukové nuance dané hlásky nebo že není kladen důraz na zvukové rozdíly mezi jed- notlivými hláskami, je jinde v textu, kde je to možné, z důvodu usnadnění porozumění užito namísto těchto specifi ckých fonetických symbolů písmen české abecedy, např.
    [Show full text]
  • Hiragana Chart
    ひらがな Hiragana Chart W R Y M H N T S K VOWEL ん わ ら や ま は な た さ か あ A り み ひ に ち し き い I る ゆ む ふ ぬ つ す く う U れ め へ ね て せ け え E を ろ よ も ほ の と そ こ お O © 2010 Michael L. Kluemper et al. Beginning Japanese, Tuttle Publishing, an imprint of Periplus Editions (HK) Ltd. All rights reserved. www.TimeForJapanese.com. 1 Beginning Japanese 名前: ________________________ 1-1 Hiragana Activity Book 日付: ___月 ___日 一、 Practice: あいうえお かきくけこ がぎぐげご O E U I A お え う い あ あ お え う い あ お う あ え い あ お え う い お う い あ お え あ KO KE KU KI KA こ け く き か か こ け く き か こ け く く き か か こ き き か こ こ け か け く く き き こ け か © 2010 Michael L. Kluemper et al. Beginning Japanese, Tuttle Publishing, an imprint of Periplus Editions (HK) Ltd. All rights reserved. www.TimeForJapanese.com. 2 GO GE GU GI GA ご げ ぐ ぎ が が ご げ ぐ ぎ が ご ご げ ぐ ぐ ぎ ぎ が が ご げ ぎ が ご ご げ が げ ぐ ぐ ぎ ぎ ご げ が 二、 Fill in each blank with the correct HIRAGANA. SE N SE I KI A RA NA MA E 1.
    [Show full text]
  • Como Digitar Em Japonês 1
    Como digitar em japonês 1 Passo 1: Mudar para o modo de digitação em japonês Abra o Office Word, Word Pad ou Bloco de notas para testar a digitação em japonês. Com o cursor colocado em um novo documento em algum lugar em sua tela você vai notar uma barra de idiomas. Clique no botão "PT Português" e selecione "JP Japonês (Japão)". Isso vai mudar a aparência da barra de idiomas. * Se uma barra longa aparecer, como na figura abaixo, clique com o botão direito na parte mais à esquerda e desmarque a opção "Legendas". ficará assim → Além disso, você pode clicar no "_" no canto superior direito da barra de idiomas, que a janela se fechará no canto inferior direito da tela (minimizar). ficará assim → © 2017 Fundação Japão em São Paulo Passo 2: Alterar a barra de idiomas para exibir em japonês Se você não consegue ler em japonês, pode mudar a exibição da barra de idioma para inglês. Clique em ツール e depois na opção プロパティ. Opção: Alterar a barra de idiomas para exibir em inglês Esta janela é toda em japonês, mas não se preocupe, pois da próxima vez que abrí-la estará em Inglês. Haverá um menu de seleção de idiomas no menu de "全般", escolha "英語 " e clique em "OK". © 2017 Fundação Japão em São Paulo Passo 3: Digitando em japonês Certifique-se de que tenha selecionado japonês na barra de idiomas. Após isso, selecione “hiragana”, como indica a seta. Passo 4: Digitando em japonês com letras romanas Uma vez que estiver no modo de entrada correto no documento, vamos digitar uma palavra prática.
    [Show full text]
  • Handy Katakana Workbook.Pdf
    First Edition HANDY KATAKANA WORKBOOK An Introduction to Japanese Writing: KANA THIS IS A SUPPLEMENT FOR BEGINNING LEVEL JAPANESE LANGUAGE INSTRUCTION. \ FrF!' '---~---- , - Y. M. Shimazu, Ed.D. -----~---- TABLE OF CONTENTS Page Introduction vi ACKNOWLEDGEMENlS vii STUDYSHEET#l 1 A,I,U,E, 0, KA,I<I, KU,KE, KO, GA,GI,GU,GE,GO, N WORKSHEET #1 2 PRACTICE: A, I,U, E, 0, KA,KI, KU,KE, KO, GA,GI,GU, GE,GO, N WORKSHEET #2 3 MORE PRACTICE: A, I, U, E,0, KA,KI,KU, KE, KO, GA,GI,GU,GE,GO, N WORKSHEET #~3 4 ADDmONAL PRACTICE: A,I,U, E,0, KA,KI, KU,KE, KO, GA,GI,GU,GE,GO, N STUDYSHEET #2 5 SA,SHI,SU,SE, SO, ZA,JI,ZU,ZE,ZO, TA, CHI, TSU, TE,TO, DA, DE,DO WORI<SHEEI' #4 6 PRACTICE: SA,SHI,SU,SE, SO, ZA,II, ZU,ZE,ZO, TA, CHI, 'lSU,TE,TO, OA, DE,DO WORI<SHEEI' #5 7 MORE PRACTICE: SA,SHI,SU,SE,SO, ZA,II, ZU,ZE, W, TA, CHI, TSU, TE,TO, DA, DE,DO WORKSHEET #6 8 ADDmONAL PRACI'ICE: SA,SHI,SU,SE, SO, ZA,JI, ZU,ZE,ZO, TA, CHI,TSU,TE,TO, DA, DE,DO STUDYSHEET #3 9 NA,NI, NU,NE,NO, HA, HI,FU,HE, HO, BA, BI,BU,BE,BO, PA, PI,PU,PE,PO WORKSHEET #7 10 PRACTICE: NA,NI, NU, NE,NO, HA, HI,FU,HE,HO, BA,BI, BU,BE, BO, PA, PI,PU,PE,PO WORKSHEET #8 11 MORE PRACTICE: NA,NI, NU,NE,NO, HA,HI, FU,HE, HO, BA,BI,BU,BE, BO, PA,PI,PU,PE,PO WORKSHEET #9 12 ADDmONAL PRACTICE: NA,NI, NU, NE,NO, HA, HI, FU,HE, HO, BA,BI,3U, BE, BO, PA, PI,PU,PE,PO STUDYSHEET #4 13 MA, MI,MU, ME, MO, YA, W, YO WORKSHEET#10 14 PRACTICE: MA,MI, MU,ME, MO, YA, W, YO WORKSHEET #11 15 MORE PRACTICE: MA, MI,MU,ME,MO, YA, W, YO WORKSHEET #12 16 ADDmONAL PRACTICE: MA,MI,MU, ME, MO, YA, W, YO STUDYSHEET #5 17
    [Show full text]
  • Galio Uma, Maƙerin Azurfa Maƙeri (M): Ni Ina Da Shekara Yanzu…
    LANGUAGE PROGRAM, 232 BAY STATE ROAD, BOSTON, MA 02215 [email protected] www.bu.edu/Africa/alp 617.353.3673 Interview 2: Galio Uma, Maƙerin Azurfa Maƙeri (M): Ni ina da shekara yanzu…, ina da vers …alal misali vers mille neuf cent quatre vingt et quatre (1984) gare mu parce que wajenmu babu hopital lokacin da anka haihe ni ban san shekaruna ba, gaskiya. Ihehehe! Makaranta, ban yi makaranta ba sabo da an sa ni makaranta sai daga baya wajenmu babana bai da hali, mammana bai da hali sai mun biya. Akwai nisa tsakaninmu da inda muna karatu. Kilometir kama sha biyar kullum ina tahiya da ƙahwa. To, ha na gaji wani lokaci mu hau jaki, wani lokaci mu hau raƙumi, to wani lokaci ma babu hali, babu karatu. Amman na yi shekara shidda ina karatu ko biyat. Amman ban ji daɗi ba da ban yi karatu ba. Sabo da shi ne daɗin duniya. Mm. To mu Buzaye ne muna yin buzanci. Ana ce muna Tuareg. Mais Tuareg ɗin ma ba guda ba ne. Mu artisans ne zan ce miki. Can ainahinmu ma ba mu da wani aiki sai ƙira. Duka mutanena suna can Abala. Nan ga ni dai na nan ga. Ka gani ina da mata, ina da yaro. Akwai babana, akwai mammana. Ina da ƙanne huɗu, macce biyu da namiji biyu. Suna duka cikin gida. Duka kuma kama ni ne chef de famille. Duka ni ne ina aider ɗinsu. Sabo da babana yana da shekara saba’in da biyar. Vers. Mammana yana da shekara shi ma hamsin da biyar.
    [Show full text]
  • ALTEC Language Class: Japanese Beginning II
    ALTEC Language Class: Japanese Beginning II Class duration: 10 weeks, January 28–April 7, 2020 (no class March 24) Class meetings: Tuesdays at 5:30pm–7:30pm in Hellems 145 Instructor: Megan Husby, [email protected] Class session Resources before coming to Practice exercises after Communicative goals Grammar Vocabulary & topic class class Talking about things that you Verb Conjugation: Past tense Review of Hiragana Intro and あ column Fun Hiragana app for did in the past of long (polite) forms Japanese your Phone (~desu and ~masu verbs) Writing Hiragana か column Talking about your winter System: Hiragana song break Hiragana Hiragana さ column (Recognition) Hiragana Practice クリスマス・ハヌカー・お Hiragana た column Worksheet しょうがつ 正月はなにをしましたか。 Winter Sports どこにいきましたか。 Hiragana な column Grammar Review なにをたべましたか。 New Year’s (Listening) プレゼントをかいましたか/ Vocab Hiragana は column もらいましたか。 Genki I pg. 110 スポーツをしましたか。 Hiragana ま column だれにあいましたか。 Practice Quiz Week 1, えいがをみましたか。 Hiragana や column Jan. 28 ほんをよみましたか。 Omake (bonus): Kasajizō: うたをききましたか/ Hiragana ら column A Folk Tale うたいましたか。 Hiragana わ column Particle と Genki: An Integrated Course in Japanese New Year (Greetings, Elementary Japanese pgs. 24-31 Activities, Foods, Zodiac) (“Japanese Writing System”) Particle と Past Tense of desu (Affirmative) Past Tense of desu (Negative) Past Tense of Verbs Discussing family, pets, objects, Verbs for being (aru and iru) Review of Katakana Intro and ア column Katakana Practice possessions, etc. Japanese Worksheet Counters for people, animals, Writing Katakana カ column etc. System: Genki I pgs. 107-108 Katakana Katakana サ column (Recognition) Practice Quiz Katakana タ column Counters Katakana ナ column Furniture and common Katakana ハ column household items Katakana マ column Katakana ヤ column Katakana ラ column Week 2, Feb.
    [Show full text]
  • Android Apps for Learning Kana Recommended by Our Students
    Android Apps for learning Kana recommended by our students [Kana column: H = Hiragana, K = Katakana] Below are some recommendations for Kana learning apps, ranked in descending order by our students. Please try a few of these and find one that suits your needs. Enjoy learning Kana! Recommended Points App Name Kana Language Description Link Listening Writing Quizzes English: https://nihongo-e-na.com/android/jpn/id739.html English, Developed by the Japan Foundation and uses Hiragana Memory Hint H Indonesian, 〇 〇 picture mnemonics to help you memorize Indonesian: https://nihongo-e-na.com/android/eng/id746.html Thai Hiragana. Thai: https://nihongo-e-na.com/android/eng/id773.html English: https://nihongo-e-na.com/android/eng/id743.html English, Developed by the Japan Foundation and uses Katakana Memory Hint K Indonesian, 〇 〇 picture mnemonics to help you memorize Indonesian: https://nihongo-e-na.com/android/eng/id747.html Thai Katakana. Thai: https://nihongo-e-na.com/android/eng/id775.html A holistic app that can be used to master Kana Obenkyo H&K English 〇 〇 fully, and eventually also for other skills like https://nihongo-e-na.com/android/eng/id602.html Kanji and grammar. A very integrated quizzing system with five Kana (Hiragana and Katakana) H&K English 〇 〇 https://nihongo-e-na.com/android/jpn/id626.html varieties of tests available. Uses SRS (Spatial Repetition System) to help Kana Town H&K English 〇 〇 https://nihongo-e-na.com/android/eng/id845.html build memory. Although the app is entirely in Japanese, it only has Hiragana and Katakana so the interface Free Learn Japanese Hiragana H&K Japanese 〇 〇 〇 does not pose a problem as such.
    [Show full text]
  • Machine Transliteration (Knight & Graehl, ACL
    Machine Transliteration (Knight & Graehl, ACL 97) Kevin Duh UW Machine Translation Reading Group, 11/30/2005 Transliteration & Back-transliteration • Transliteration: • Translating proper names, technical terms, etc. based on phonetic equivalents • Complicated for language pairs with different alphabets & sound inventories • E.g. “computer” --> “konpyuutaa” 䜷䝷䝗䝩䜪䝃䞀 • Back-transliteration • E.g. “konpyuuta” --> “computer” • Inversion of a lossy process Japanese/English Examples • Some notes about Japanese: • Katakana phonetic system for foreign names/loan words • Syllabary writing: • e.g. one symbol for “ga”䚭䜰, one for “gi”䚭䜲 • Consonant-vowel (CV) structure • Less distinction of L/R and H/F sounds • Examples: • Golfbag --> goruhubaggu 䜸䝯䝙䝔䝇䜴 • New York Times --> nyuuyooku taimuzu䚭䝏䝩䞀䝬䞀䜳䚭䝃䜨䝤䜾 • Ice cream --> aisukuriimu 䜦䜨䜽䜳䝮䞀䝤 The Challenge of Machine Back-transliteration • Back-transliteration is an important component for MT systems • For J/E: Katakana phrases are the largest source of phrases that do not appear in bilingual dictionary or training corpora • Claims: • Back-transliteration is less forgiving than transliteration • Back-transliteration is harder than romanization • For J/E, not all katakana phrases can be “sounded out” by back-transliteration • word processing --> waapuro • personal computer --> pasokon Modular WSA and WFSTs • P(w) - generates English words • P(e|w) - English words to English pronounciation • P(j|e) - English to Japanese sound conversion • P(k|j) - Japanese sound to katakana • P(o|k) - katakana to OCR • Given a katana string observed by OCR, find the English word sequence w that maximizes !!!P(w)P(e | w)P( j | e)P(k | j)P(o | k) e j k Two Potential Solutions • Learn from bilingual dictionaries, then generalize • Pro: Simple supervised learning problem • Con: finding direct correspondence between English alphabets and Japanese katakana may be too tenuous • Build a generative model of transliteration, then invert (Knight & Graehl’s approach): 1.
    [Show full text]
  • KANA Response Live Organization Administration Tool Guide
    This is the most recent version of this document provided by KANA Software, Inc. to Genesys, for the version of the KANA software products licensed for use with the Genesys eServices (Multimedia) products. Click here to access this document. KANA Response Live Organization Administration KANA Response Live Version 10 R2 February 2008 KANA Response Live Organization Administration All contents of this documentation are the property of KANA Software, Inc. (“KANA”) (and if relevant its third party licensors) and protected by United States and international copyright laws. All Rights Reserved. © 2008 KANA Software, Inc. Terms of Use: This software and documentation are provided solely pursuant to the terms of a license agreement between the user and KANA (the “Agreement”) and any use in violation of, or not pursuant to any such Agreement shall be deemed copyright infringement and a violation of KANA's rights in the software and documentation and the user consents to KANA's obtaining of injunctive relief precluding any further such use. KANA assumes no responsibility for any damage that may occur either directly or indirectly, or any consequential damages that may result from the use of this documentation or any KANA software product except as expressly provided in the Agreement, any use hereunder is on an as-is basis, without warranty of any kind, including without limitation the warranties of merchantability, fitness for a particular purpose, and non-infringement. Use, duplication, or disclosure by licensee of any materials provided by KANA is subject to restrictions as set forth in the Agreement. Information contained in this document is subject to change without notice and does not represent a commitment on the part of KANA.
    [Show full text]