A Comparison of Entity Matching Methods Between English and Japanese Katakana

A Comparison of Entity Matching Methods between English and Japanese Katakana Michiharu Yamashita Hideki Awashima Hidekazu Oiwa∗ Recruit Co., Ltd. / Megagon Labs Tokyo, Japan fchewgen, [email protected], [email protected] Abstract The Japanese language has three kinds of char- Japanese Katakana is one component of the acter types, and they are used for different pur- Japanese writing system and is used to express poses (Nagata, 1998). One of the character types English terms, loanwords, and onomatopoeia is Katakana, which is used to convert English in Japanese characters based on the phonemes. words, foreign languages, and alphabet letters into The main purpose of this research is to find the Japanese characters (Martin, 2004). Katakana best entity matching methods between English is often transliterated by phonemes unique to and Katakana. We built two research ques- Japanese and that is similar but different from En- tions to clarify which types of entity match- glish pronunciation. In addition, whether terms ing systems works better than others. The first question is what transliteration should be are expressed in English or Katakana is dependent used for conversion. We need to transliter- on sites. For example, on Japanese web pages, ate English or Katakana terms into the same there are many restaurants written in English and form in order to compute the string similar- Japanese even if they are the same stores such as ity. We consider five conversions that translit- “Wendy’s”and“ウェンディーズ”. If it is the erate English to Katakana directly, Katakana same type of character, it is easier to identify the to English directly, English to Katakana via entity simply by calculating the similarity of the phoneme, Katakana to English via phoneme, and both English and Katakana to phoneme. string, but in the case of different writing systems The second question is what should be used like English and Katakana, it is difficult to identify for the similarity measure at entity match- the entity. ing. To investigate the problem, we choose In this research, we clarify the problem by ex- six methods, which are Overlap Coefficient, ploring the following two research questions. Cosine, Jaccard, Jaro-Winkler, Levenshtein, (1) What transliteration should be used for con- and the similarity of the phoneme probability version? predicted by RNN. Our results show that 1) In order to change the same string form, the fol- matching using phonemes and conversion of Katakana to English works better than other lowing method can be considered. methods, and 2) the similarity of phonemes outperforms other methods while other similarity score is changed depending on data and models. 1 Introduction Cleansing and preprocessing data is one of the essential tasks in data analysis such as natural lan- Figure 1: Method to Convert the Entity Name. guage processing (Witten et al., 2016). In particu- lar, finding the same entity from multiple datasets The first and second methods are to convert En- is a important task. For example, when the same glish to Katakana or Katakana to English and then entities are expressed by different languages, you match the entities. need to convert them to the same writing format The third and fourth methods are to use pro- before entity matching. nunciation information. Katakana is based on ∗The author is now at Google Inc. phonemes and is a syllable system, where each 84 Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 84–92 Brussels, Belgium, October 31, 2018. c 2018 The Special Interest Group on Computational Morphology and Phonology https://doi.org/10.18653/v1/P17 syllabogram corresponds to one sound in the transliteration to solve the task of entity matching. Japanese language. Therefore, the methods match There are some transliteration and entity match- the entities after converting English or Katakana ing studies, but there is little research that solves into phonemes and converting the transliterated entity matching using transliteration information. phonemes to Katakana or English. Our motivation is to extend our database from ex- The fifth method also uses phonemes. This ternal data by entity matching because we have method matches the entities based on the translit- relations of many types of clients such as restau- erated phoneme from both English and Katakana. rants, beauty salons, and companies and extension (2) What should be used for the similarity mea- of data is essential for discovery of new clients. sure? Therefore, we need to transform the name of the In order to calculate the similarity of a charac- entities and to find which methods are the best for ter string for entity matching, it is necessary to se- entity matching between English and Katakana. lect measures from many similarity measures. In this research, as commonly used similarity mea- 3 Japanese Characters sures, we use the similarity of Overlap Coeffi- cient, Cosine, Jaccard, Jaro-Winkler, and Leven- Japanese characters are normally written in a com- shtein (Cohen et al., 2003). Moreover, we propose bination of three character types. One type is a similarity method using the probability of the ideographic characters, Kanji, from China, and the phonemes by prediction model. We clarify which other two types, Hiragana and Katakana, are pho- of the six similarity methods should be used to netic characters. Kanji is mainly used for nouns compare the accuracy. and stems of adjectives and verbs, and Hiragana is used for helpful markings of adjectives, verbs and 2 Related Work Japanese words that are not expressed in Kanji. On the other hand, Katakana is used to write sound ef- Entity matching is a crucial task, and there is a fects and transcribe foreign words (Martin, 2004). lot of research on entity matching (Shen et al., When we try entity matching with Japanese data, 2015; Cai et al., 2013; Carmel et al., 2014; we usually face English expressed in Japanese Mudgal et al., 2018). In these studies, the attribute Katakana in restaurants, companies, books, elec- information of an entity is used. In the case trical items, and so on. We usually cannot find where there is no attribute and there is only the two names where one is written in English and the entity name, the character name information must other is written in Katakana within enormous data be used. Different from general entity linking because Japanese speakers use both English and tasks, some works match entities only on entries Katakana to write foreign words. in tables (Munoz˜ et al., 2014; Sekhavat et al., Dictionaries already exist for English words 2014). Although these studies match entities by with Japanese meanings, but few dictionaries also collecting additional information on the entity, exist for English with Katakana. The report pronunciation information is not used. (Benson et al., 2009) mentions that the Japanese In addition to studies of entity matching, language is based on morae rather than syllables. transliteration is also studied. Transliteration A mora is a unit of sound that contributes to a is a task that converts a word in a language syllable’s weight. Katakana is more accurately into a character of a different language and described as a way to write the set of Japanese makes it as closely as possible to the native morae rather than the set of Japanese syllables, as pronunciation. Many studies on translitera- each symbol represents not a syllable but a unit of tion are also conducted such as those on Hindi sound of Japanese speech. A mora-based writing and Myanmar (Pandey and Roy, 2017; Thu et al., system in Japanese represents a dimension of the 2016). Some studies consider pronunciation in- language that has no corresponding representation formation in transliteration (Yao and Zweig, 2015; in English. This challenges the transliteration task Toshniwal and Livescu, 2016; Rao et al., 2015). of English and Katakana. Therefore, it is not easy Transliteration differs from entity matching itself to convert English into Katakana. in the purpose of the task, but it is applicable to The Japanese language also has a method to entity matching because transliteration can extend transliterate Katakana into alphabet characters, the information of the entity. Therefore, we use and this transliterated alphabet is called Romaji, 85 which is a phoneme of Japanese characters (Smith, but the actual Katakana is ”オレンジ”. In addition, 1996). Romaji is used in any context for non- we also used Soundex2 as a benchmark of entity Japanese speakers who cannot read Japanese char- matching between phoneme pairs. Soundex is a acters, such as for names, passports, and any phonetic algorithm for indexing names by sound, Japanese entities. Romaji is the most common as pronounced in English. way to input Japanese into computers and to dis- 4.2 Predictive Model play Japanese on devices that do not support Japanese characters (DeFrancis, 1984), and almost We used a sequence to sequence model for translit- all Japanese people learn Romaji and are able to eration. For example, the input is the sequence of read and write Japanese using Romaji. There- English characters (x1; :::; xn), and the output is fore, generally speaking, Japanese people who do the sequence of phoneme characters (y1; :::; ym). not write in English usually use Romaji to express In our model, we estimated the conditional prob- Katakana or foreign terms without Japanese char- ability p of an output sequence (y1; :::; ym) given acters. an input sequence (x1; :::; xn) as follows: j 4 Methods p(y1; :::; ym x1; :::; xn) (1) Given an input sequence (x ; :::; x ), LSTM To solve the task of entity matching between En- 1 n computes a sequence of hidden states (h ; :::; h ). glish and Katakana, the entity name must some- 1 n During decoding, it defines a distribution over the how be transliterated.

Load more