Correcting Keyboard Layout Errors and Homoglyphs in Queries

Correcting Keyboard Layout Errors and Homoglyphs in Queries Derek Barnes Mahesh Joshi Hassan Sawaf [email protected] [email protected] [email protected] eBay Inc., 2065 Hamilton Ave, San Jose, CA, 95125, USA Abstract keyboard in Russian language mode. Queries containing KLEs or homoglyphs are unlikely to pro- Keyboard layout errors and homoglyphs duce any search results, unless the intended ASCII in cross-language queries impact our abil- sequences can be recovered. In a test set sam- ity to correctly interpret user informa- pled from Russian/English queries with null (i.e. tion needs and offer relevant results. empty) search results (see Section 3.1), we found We present a machine learning approach approximately 7.8% contained at least one KLE or to correcting these errors, based largely homoglyph. on character-level n-gram features. We In this paper, we present a machine learning demonstrate superior performance over approach to identifying and correcting query to- rule-based methods, as well as a signif- kens containing homoglyphs and KLEs. We show icant reduction in the number of queries that the proposed method offers superior accuracy that yield null search results. over rule-based methods, as well as significant im- provement in search recall. Although we focus our 1 Introduction results on Russian/English queries, the techniques The success of an eCommerce site depends on (particularly for KLEs) can be applied to other lan- how well users are connected with products and guage pairs that use different character sets, such services of interest. Users typically communi- as Korean-English and Thai-English. cate their desires through search queries; however, 2 Methodology queries are often incomplete and contain errors, which impact the quantity and quality of search In cross-border trade at eBay, multilingual queries results. are translated into the inventory’s source language New challenges arise for search engines in prior to search. A key application of this, and cross-border eCommerce. In this paper, we fo- the focus of this paper, is the translation of Rus- cus on two cross-linguistic phenomena that make sian queries into English, in order to provide Rus- interpreting queries difficult: (i) Homoglyphs: sian users a more convenient interface to English- (Miller, 2013): Tokens such as “case” (underlined based inventory in North America. The presence letters Cyrillic), in which users mix characters of KLEs and homoglyphs in multilingual queries, from different character sets that are visually simi- however, leads to poor query translations, which in lar or identical. For instance, English and Russian turn increases the incidence of null search results. alphabets share homoglyphs such as c, a, e, o, y, We have found that null search results correlate k, etc. Although the letters are visually similar or with users exiting our site. in some cases identical, the underlying character In this work, we seek to correct for KLEs and codes are different. (ii) Keyboard Layout Errors homoglyphs, thereby improving query translation, (KLEs): (Baytin et al., 2013): When switching reducing the incidence of null search results, and one’s keyboard between language modes, users at increasing user engagement. Prior to translation times enter terms in the wrong character set. For and search, we preprocess multilingual queries instance, “чехол шзфв” may appear to be a Rus- by identifying and transforming KLEs and homo- sian query. While “чехол” is the Russian word glyphs as follows (we use the query “чехол шзфв for “case”, “шзфв” is actually the user’s attempt 2 new” as a running example): to enter the characters “ipad” while leaving their (a) Tag Tokens: label each query token 621 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 621–626, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics with one of the following semantically moti- KLEs or homoglyph tokens, despite appearing on vated classes, which identify the user’s informa- the surface to be Russian terms, will generally tion need: (i) E: a token intended as an English have low probability in the LMs trained on valid search term; (ii) R: a Cyrillic token intended as a Russian words. Once mapped into ASCII (see Russian search term; (iii) K: A KLE, e.g. “шзфв” Section 2 above), however, these tokens tend to for the term “ipad”. A token intended as an En- have higher probability in the English LMs. LMs glish search term, but at least partially entered in are trained on the following corpora: the Russian keyboard layout; (iv) H: A Russian English and Russian Vocabulary: based on homoglyph for an English term, e.g. “вмw” (un- a collection of open source, parallel En- derlined letters Cyrillic). Employs visually sim- glish/Russian corpora ( 50M words in all). ∼ ilar letters from the Cyrillic character set when English Brands: built from a curated list of 35K spelling an intended English term; (v) A: Ambigu- English brand names, which often have distinctive ous tokens, consisting of numbers and punctuation linguistic properties compared with common En- characters with equivalent codes that can be english words (Lowrey et al., 2013). tered in both Russian and English keyboard lay- Russian Transliterations: built from a col- outs. Given the above classes, our example query lection of Russian transliterations of proper “чехол шзфв 2 new” should be tagged as “RKA names from Wikipedia (the Russian portion of E”. guessed-names.ru-en made available as a (b) Transform Queries: Apply a deterministic part of WMT 20131). mapping to transform KLE and homoglyph tokens For every input token, each of the above LMs from Cyrillic to ASCII characters. For KLEs the fires a real-valued feature — the negated log- transformation maps between characters that share probability of the token in the given language the same location in Russian and English keyboard model. Additionally, for tokens containing Cyril- layouts (e.g. ф a, ы s). For homoglyphs the lic characters, we consider the token’s KLE and → → transformation maps between a smaller set of vi- homoglyph ASCII mappings, where available. For sually similar characters (e.g. е e, м m). Our each mapping, a real-valued feature fires corre- → → example query would be transformed into “чехол sponding to the negated log-probability of the ipad 2 new”. mapped token in the English and Brands LMs. (c) Translate and Search: Translate the trans- Lastly, an equivalent set of LM features fires for formed query (into “case ipad 2 new” for our ex- the two preceding and following tokens around the ample), and dispatch it to the search engine. current token, if applicable. In this paper, we formulate the token-level tagging task as a standard multiclass classification 2.1.2 Token Features problem (each token is labeled independently), as We include several features commonly used in well as a sequence labeling problem (a first order token-level tagging problems, such as case and conditional Markov model). In order to provide shape features, token class (such as letters-only, end-to-end results, we preprocess queries by de- digits-only), position of the token within the query, terministically transforming into ASCII the tokens and token length. In addition, we include fea- tagged by our model as KLEs or homoglyphs. We tures indicating the presence of characters from conclude by presenting an evaluation of the impact the ASCII and/or Cyrillic character sets. of this transformation on search. 2.1.3 Dictionary Features 2.1 Features We incorporate a set of features that indicate Our classification and sequence models share a whether a given lowercased query token is a mem- common set of features grouped into the follow- ber of one of the lexicons described below. ing categories: UNIX: The English dictionary shipped with Cen- tOS, including 480K entries, used as a lexicon ∼ 2.1.1 Language Model Features of common English words. A series of 5-gram, character-level language mod- BRANDS: An expanded version of the curated list els (LMs) capture the structure of different types of brand names used for LM features. Includes of words. Intuitively, valid Russian terms will 1www.statmt.org/wmt13/ have high probability in Russian LMs. In contrast, translation-task.html#download 622 58K brands. as R. A token containing only ASCII characters is ∼ PRODUCT TITLES: A lexicon of over 1.6M en- labeled as A if all characters are common to En- tries extracted from a collection of 10M product glish and Russian keyboards (i.e. numbers and titles from eBay’s North American inventory. some punctuation), otherwise E. For tokens con- QUERY LOGS: A larger, in-domain collection of taining Cyrillic characters, KLE and homoglyph- approximately 5M entries extracted from 100M mapped versions are searched in our dictionaries. ∼ English search queries on eBay. If found, K or H are assigned. If both mapped ver- Dictionary features fire for Cyrillic tokens when sions are found in the dictionaries, then either K the KLE and/or homoglyph-mapped version of the or H is assigned probabilistically4. In cases where token appears in the above lexicons. Dictionary neither mapped version is found in the dictionary, features are binary for the Unix and Brands dictio- the token assigned is either R or A, depending on naries, and weighted by relative frequency of the whether it consists of purely Cyrillic characters, or entry for the Product Titles and Query Logs dic- a mix of Cyrillic and ASCII, respectively. tionaries. Note that the above tagging rules allow tokens with classes E and A to be identified with perfect 3 Experiments accuracy. As a result, we omit these classes from 3.1 Datasets all results reported in this work. We also note that this simplification applies because we have The following datasets were used for training and restricted our attention to the Russian English → evaluating the baseline (see Section 3.2 below) and direction.

Correcting Keyboard Layout Errors and Homoglyphs in Queries

Sig Process Book

Package Mathfont V. 1.6 User Guide Conrad Kosowsky December 2019 [email protected]

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

Invisible-Punctuation.Pdf

Iso/Iec Jtc1/Sc2/Wg2 N4907 L2/17-359

Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements

Unicode Regular Expressions Technical Reports

Supplemental Punctuation Range: 2E00–2E7F

Double Hyphen" Source: Karl Pentzlin Status: Individual Contribution Action: for Consideration by JTC1/SC2/WG2 and UTC Date: 2010-09-28

Revised Proposal to Encode a Punctuation Mark "Double Hyphen"

CJK Symbols and Punctuation Range: 3000–303F

Imperceptible NLP Attacks