<<

Correcting Errors and in Queries

Derek Barnes Mahesh Joshi Hassan Sawaf [email protected] [email protected] [email protected]

eBay Inc., 2065 Hamilton Ave, San Jose, , 95125, USA

Abstract keyboard in mode. Queries con- taining KLEs or homoglyphs are unlikely to pro- Keyboard layout errors and homoglyphs duce any search results, unless the intended ASCII in cross-language queries impact our abil- sequences can be recovered. In test set sam- ity to correctly interpret user informa- pled from Russian/English queries with null (.. tion needs and offer relevant results. empty) search results (see Section 3.1), found We present a machine learning approach approximately 7.8% contained at least one KLE or to correcting these errors, based largely . on -level -gram features. We In this paper, we present a machine learning demonstrate superior performance over approach to identifying and correcting query to- rule-based methods, as well as a signif- kens containing homoglyphs and KLEs. We show icant reduction in the number of queries that the proposed method offers superior accuracy that yield null search results. over rule-based methods, as well as significant im- provement in search recall. Although we focus our 1 Introduction results on Russian/English queries, the techniques The success of an eCommerce site depends on (particularly for KLEs) can be applied to other lan- how well users are connected with products and guage pairs that use different character sets, such services of interest. Users typically communi- as Korean-English and Thai-English. cate their desires through search queries; however, 2 Methodology queries are often incomplete and contain errors, which impact the quantity and quality of search In cross-border trade at eBay, multilingual queries results. are translated into the inventory’ source language New challenges arise for search engines in prior to search. A key application of this, and cross-border eCommerce. In this paper, we fo- the focus of this paper, is the translation of Rus- cus on two cross-linguistic phenomena that make sian queries into English, in order to provide Rus- interpreting queries difficult: (i) Homoglyphs: sian users a more convenient interface to English- (Miller, 2013): Tokens such as “case” (underlined based inventory in North America. The presence letters Cyrillic), in which users mix characters of KLEs and homoglyphs in multilingual queries, from different character sets that are visually simi- however, leads to poor query translations, which in lar or identical. For instance, English and Russian turn increases the incidence of null search results. share homoglyphs such as , a, e, , , We have found that null search results correlate , etc. Although the letters are visually similar or with users exiting our site. in some cases identical, the underlying character In this work, we seek to correct for KLEs and codes are different. (ii) Keyboard Layout Errors homoglyphs, thereby improving query translation, (KLEs): (Baytin et al., 2013): When switching reducing the incidence of null search results, and one’s keyboard between language modes, users at increasing user engagement. Prior to translation times enter terms in the wrong character set. For and search, we preprocess multilingual queries instance, “чехол шзфв” may appear to be a Rus- by identifying and transforming KLEs and homo- sian query. While “чехол” is the Russian as follows (we use the query “чехол шзфв for “case”, “шзфв” is actually the user’s attempt 2 new” as a running example): to enter the characters “ipad” while leaving their (a) Tag Tokens: label each query token

621 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 621–626, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics with one of the following semantically moti- KLEs or homoglyph tokens, despite appearing on vated classes, which identify the user’s informa- the surface to be Russian terms, will generally tion need: (i) E: a token intended as an English have low probability in the LMs trained on valid search term; (ii) : a Cyrillic token intended as a Russian . Once mapped into ASCII (see Russian search term; (iii) K: A KLE, e.. “шзфв” Section 2 above), however, these tokens tend to for the term “ipad”. A token intended as an - have higher probability in the English LMs. LMs glish search term, but at least partially entered in are trained on the following corpora: the Russian keyboard layout; (iv) : A Russian English and Russian Vocabulary: based on homoglyph for an English term, e.g. “вмw” (un- a collection of open source, parallel En- derlined letters Cyrillic). Employs visually sim- glish/Russian corpora ( 50M words in all). ∼ ilar letters from the Cyrillic character set when English Brands: built from a curated list of 35K spelling an intended English term; () A: Ambigu- English brand names, which often have distinctive ous tokens, consisting of numbers and linguistic properties compared with common En- characters with equivalent codes that can be en- glish words (Lowrey et al., 2013). tered in both Russian and English keyboard lay- Russian Transliterations: built from a col- outs. Given the above classes, our example query lection of Russian transliterations of proper “чехол шзфв 2 new” should be tagged as “RKA names from Wikipedia (the Russian portion of E”. guessed-names.ru-en made available as a () Transform Queries: Apply a deterministic part of WMT 20131). mapping to transform KLE and homoglyph tokens For every input token, each of the above LMs from Cyrillic to ASCII characters. For KLEs the fires a real-valued feature — the negated log- transformation maps between characters that share probability of the token in the given language the same location in Russian and English keyboard model. Additionally, for tokens containing Cyril- layouts (e.g. ф a, ы s). For homoglyphs the lic characters, we consider the token’s KLE and → → transformation maps between a smaller set of vi- homoglyph ASCII mappings, where available. For sually similar characters (e.g. е e, м ). Our each mapping, a real-valued feature fires corre- → → example query would be transformed into “чехол sponding to the negated log-probability of the ipad 2 new”. mapped token in the English and Brands LMs. (c) Translate and Search: Translate the trans- Lastly, an equivalent set of LM features fires for formed query (into “case ipad 2 new” for our ex- the two preceding and following tokens around the ample), and dispatch it to the search engine. current token, if applicable. In this paper, we formulate the token-level tag- ging task as a standard multiclass classification 2.1.2 Token Features problem (each token is labeled independently), as We include several features commonly used in well as a sequence labeling problem (a first order token-level tagging problems, such as case and conditional Markov model). In order to provide shape features, token class (such as letters-only, end-to-end results, we preprocess queries by de- digits-only), position of the token within the query, terministically transforming into ASCII the tokens and token length. In addition, we include fea- tagged by our model as KLEs or homoglyphs. We tures indicating the presence of characters from conclude by presenting an evaluation of the impact the ASCII and/or Cyrillic character sets. of this transformation on search. 2.1.3 Dictionary Features 2.1 Features We incorporate a set of features that indicate Our classification and sequence models share a whether a given lowercased query token is a mem- common set of features grouped into the follow- ber of one of the lexicons described below. ing categories: UNIX: The English dictionary shipped with Cen- tOS, including 480K entries, used as a lexicon ∼ 2.1.1 Language Model Features of common English words. A series of 5-gram, character-level language mod- BRANDS: An expanded version of the curated list els (LMs) capture the structure of different types of brand names used for LM features. Includes of words. Intuitively, valid Russian terms will 1www.statmt.org/wmt13/ have high probability in Russian LMs. In contrast, translation-task.#download

622 58K brands. as R. A token containing only ASCII characters is ∼ PRODUCT TITLES: A lexicon of over 1.6M en- labeled as A if all characters are common to En- tries extracted from a collection of 10M product glish and Russian keyboards (i.e. numbers and titles from eBay’s North American inventory. some punctuation), otherwise E. For tokens con- QUERY LOGS: A larger, in-domain collection of taining Cyrillic characters, KLE and homoglyph- approximately 5M entries extracted from 100M mapped versions are searched in our dictionaries. ∼ English search queries on eBay. If found, K or H are assigned. If both mapped ver- Dictionary features fire for Cyrillic tokens when sions are found in the dictionaries, then either K the KLE and/or homoglyph-mapped version of the or H is assigned probabilistically4. In cases where token appears in the above lexicons. Dictionary neither mapped version is found in the dictionary, features are binary for the Unix and Brands dictio- the token assigned is either R or A, depending on naries, and weighted by relative frequency of the whether it consists of purely Cyrillic characters, or entry for the Product Titles and Query Logs dic- a mix of Cyrillic and ASCII, respectively. tionaries. that the above tagging rules allow tokens with classes E and A to be identified with perfect 3 Experiments accuracy. As a result, we omit these classes from 3.1 Datasets all results reported in this work. We also note that this simplification applies because we have The following datasets were used for training and restricted our attention to the Russian English → evaluating the (see Section 3.2 below) and direction. In the bidirectional case, ASCII tokens our proposed systems: could represent either English tokens or KLEs (i.e. Training Set: A training set of 6472 human- a Russian term entered in the English keyboard labeled query examples (17,239 tokens). layout). We leave the joint treatment of the bidi- In-Domain Query Test Set: A set of 2500 Rus- rectional case to future work. sian/English queries (8,357 tokens) randomly - lected from queries with null search results. By Tag Prec Recall F1 K .528 .924 .672 focusing on queries with null results, we empha- H .347 .510 .413 size the presence of KLEs and homoglyphs, which R .996 .967 .982 occur in 7.8% of queries in our test set. Queries were labeled by a team of Russian lan- Table 1: Baseline results on the test set, using guage specialists. The test set was also indepen- UNIX, BRANDS, and the PRODUCT TITLES dic- dently reviewed, which resulted in the correction tionaries. of labels for 8 out of the 8,357 query tokens. Although our test set is representative of the We experimented with different combinations types of problematic queries targeted by our of dictionaries, and found the best combination to model, our training data was not sampled using the be UNIX, BRANDS, and PRODUCT TITLES dic- same methodology. We expect that the differences tionaries (see Table 1). We observed a sharp de- in distributions between training and test sets, if crease in precision when incorporating the QUERY anything, make the results reported in Section 3.3 LOGS dictionary, likely due to noise in the user- somewhat pessimistic2. generated content. Error analysis suggests that shorter words are 3.2 Dictionary Baseline the most problematic for the baseline system5. We implemented a rule-based baseline system - Shorter Cyrillic tokens, when transformed from ploying the dictionaries described in Section 2.1.3. Cyrillic to ASCII using KLE or homoglyph map- In this system, each token was assigned a class pings, have a higher probability of spuriously k E, R, K, H, A using a set of rules: a token mapping to valid English , model IDs, ∈ { } among a list of 101 Russian stopwords3 is tagged or short words. For instance, Russian car brand “ваз” maps across keyboard layouts to “dfp”, 2As expected, cross-validation experiments on the train- ing data (for parameter tuning) yielded results slightly higher 4We experimented with selecting K or H based on a prior than the results reported in Section 3.3, which use a held-out computed from training data; however, results were lower test set than those reported, which use random selection. 3Taken from the Russian Analyzer packaged with Lucene 5Stopwords are particularly problematic, and hence ex- — see lucene.apache.org. cluded from consideration as KLEs or homoglyphs.

623 Classification Sequence Tag quence model achieved the lowest with 97.78%, P R F1 P R F1 K .925 .944 .935 .915 .934 .925 the RF sequence model the highest with 97.90%). LR H .708 .667 .687 .686 .686 .686 Our feature ablation experiments show that R .996 .997 .996 .997 .996 .997 the majority of predictive power comes from the K .926 .949 .937 .935 .949 .942 RF H .732 .588 .652 .750 .588 .659 character-level LM features. Dropping LM fea- R .997 .997 .997 .996 .998 .997 tures results in a significant reduction in perfor- mance (F1 scores .878 and .638 for the RF Se- Table 2: Classification and sequence tagging re- quence model on classes K and H). These results sults on the test set are still significantly above the baseline, suggest- ing that token and dictionary features are by them- a commonly used in product titles for selves good predictors. However, we do not see “Digital Flat Panel”. Russian words “муки” and a similar performance reduction when dropping “рук” similarly map by chance to English words these feature groups. “verb” and “her”. We experimented with lexical features, which A related problem occurs with product model are commonly used in token-level tagging prob- IDs, and highlights the limits of treating query to- lems. Results, however, were slightly lower than kens independently. Consider Cyrillic query “БМВ the results reported in this section. We suspect the e46”. The first token is a Russian transliteration issue is one of overfitting, due to the limited size of for the BMW brand. The second token, “e46”, our training data, and general sparsity associated has three possible interpretations: i) as a Russian with lexical features. Continuous word presenta- token; ii) a homoglyph for ASCII “e46”; or iii) tions (Mikolov et al., 2013), noted as future work, a KLE for “t46”. It is difficult to discriminate may offer improved generalization. between these options without considering token Error analysis for our machine learning mod- context, and in this case having some prior knowl- els suggests patterns similar to those reported in edge that e46 is a BMW model. Section 3.2. Although errors are significantly less 3.3 Machine Learning Models frequent than in our dictionary baseline, shorter words still present the most difficulty. We note We trained linear classification models using lo- 6 as future work the use of word-level LM scores gistic regression (LR) , and non-linear models us- to target errors with shorter words. ing random forests (RFs), using implementations from the Scikit-learn package (Pedregosa et al., 3.4 Search Results 2011). Sequence models are implemented as first order conditional Markov models by applying a Recall that we translate multilingual queries into beam search (k = 3) on top of the LR and RF English prior to search. KLEs and homoglyphs classifiers. The LR and RF models were tuned us- in queries result in poor query translations, often ing 5-fold cross-validation results, with models se- to null search results. lected based on the mean F1 score across R, K, and To evaluate the impact of KLE and homoglyph H tags. correction, we consider a set of 100k randomly se- Table 2 shows the token-level results on our in- lected Russian/English queries. We consider the domain test set. As with the baseline, we focus the subset of queries that the RF or baseline models model on disambiguating between classes R, K and predict as containing a KLE or homoglyph. Next, H. Each of the reported models performs signifi- we translate into English both the original query, cantly better than the baseline (on each tag), with as well as a transformed version of it, with KLEs statistical significance evaluated using McNemar’s and homoglyphs replaced with their ASCII map- test. The differences between LR and RF mod- pings. Lastly, we execute independent searches els, as well as sequence and classification variants, using original and transformed query translations. however, are not statistically significant. Each of Table 3 provides details on search results for the machine learning models achieves a query- original and transformed queries. The baseline level accuracy score of roughly 98% (the LR se- model transforms over 12.6% of the 100k queries. Of those, 24.3% yield search results where the un- 6Although CRFs are state-of-the-art for many tagging problems, in our experiments they yielded results slightly modified queries had null search results (i.e. Null lower than LR or RF models. Non-null). In 20.9% of the cases, however, the →

624 transformations are destructive (i.e. Non-null 5 Conclusions and Future Work → Null), and yield null results where the unmodified We investigate two kinds of errors in search query produced results. queries: keyboard layout errors (KLEs) and ho- Compared with the baseline, the RF model moglyphs. Applying machine learning methods, transforms only 7.4% of the 100k queries; a frac- we are able to accurately identify a user’s intended tion that is roughly in line with the 7.8% of queries query, in spite of the presence of KLEs and ho- in our test set that contain KLEs or homoglyphs. moglyphs. The proposed models are based largely In over 42% of the cases (versus 24.3% for the on compact, character-level language models. The baseline), the transformed query generates search proposed techniques, when applied to multilingual results where the original query yields none. Only queries prior to translation and search, offer signif- 4.81% of the transformations using the RF model icant gains in search results. are destructive; a fraction significantly lower than In the future, we plan to focus on additional fea- the baseline. tures to improve KLE and homoglyph discrimina- Note that we distinguish here only between tion for shorter words and acronyms. Although queries that produce null results, and those that do lexical features did not prove useful for this work, not. We do not include queries for which original presumably due to data sparsity and overfitting and transformed queries both produce (potentially issues, we intend to explore the application of differing) search results. Evaluating these cases continuous word representations (Mikolov et al., requires deeper insight into the relevance of search 2013). Compared with lexical features, we expect results, which is left as future work. continuous representations to be less susceptible to overfitting, and to generalize better to unknown Baseline RF model words. For instance, using continuous word rep- #Transformed 12,661 7,364 resentations, Turian et al. (2010) show significant Null Non-Null 3,078 (24.3%) 3,142 (42.7%) gains for a named entity recognition task. Non-Null→ Null 2,651 (20.9%) 354 (4.81%) → We also intend on exploring the use of features Table 3: Impact of KLE and homoglyph correction from in-domain, word-level LMs. Word-level fea- on search results for 100k queries tures are expected to be particularly useful in the case of spurious mappings (e.g. “ваз” vs. “dfp” from Section 3.2), where context from surround- ing tokens in a query can often help in resolving 4 Related Work ambiguity. Word-level features may also be useful in re-ranking translated queries prior to search, in Baytin et al. (2013) first refer to keyboard lay- order to reduce the incidence of erroneous query out errors in their work. However, their focus is transformations generated through our methods. on predicting the performance of spell-correction, Finally, our future work will explore KLE and ho- not on fixing KLEs observed in their data. To moglyph correction bidirectionally, as opposed to our knowledge, our work is the first to introduce the unidirectional approach explored in this work. this problem and to propose a machine learning solution. Since our task is a token-level tagging Acknowledgments problem, it is very similar to the part-of-speech We would like to thank Jean-David Ruvini, Mike (POS) tagging task (Ratnaparkhi, 1996), only with Dillinger, Sasaˇ Hasan, Irina Borisova and the a very small set of candidate tags. We chose anonymous reviewers for their valuable feedback. a supervised machine learning approach in order We also thank our Russian language special- to achieve maximum precision. However, this ists Tanya Badeka, Tatiana Kontsevich and Olga problem can also be approached in an unsuper- Pospelova for their support in labeling and review- vised setting, similar to the method Whitelaw et al. ing datasets. (2009) use for spelling correction. In that setup, the goal would be to directly choose the correct transformation for an ill-formed KLE or homo- References , instead of a tagging step followed by a de- Alexey Baytin, Irina Galinskaya, Marina Panina, and terministic mapping to ASCII. Pavel Serdyukov. 2013. Speller performance pre-

625 diction for query autocorrection. In Proceedings of the 22nd ACM International Conference on Con- ference on Information & Knowledge Management, pages 1821–1824.

Tina M. Lowrey, Larry . Shrum, and Tony M. Du- bitsky. 2013. The Relation Between Brand-name Linguistic Characteristics and Brand-name Memory. Journal of Advertising, 32(3):7–17. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector . CoRR, abs/1301.3781. Tristan Miller. 2013. Russian–English Homoglyphs, Homographs, and Homographic Translations. Word Ways: The Journal of Recreational Linguistics, 46(3):165–168.

Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- dre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830. Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for Part–of–Speech Tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133–142.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL, pages 384–394.

Casey Whitelaw, Ben Hutchinson, Grace Y. Chung, and Ged Ellis. 2009. Using the Web for Lan- guage Independent Spellchecking and Autocorrec- tion. In Proceedings of the 2009 Conference on Em- pirical Methods in Natural Language Processing, pages 890–899.

626