Spelling Correction As a Foreign Language

Spelling Correction as a Foreign Language Yingbo Zhou∗ Utkarsh Porwal Roberto Konow eBay Inc eBay Inc eBay Inc San Jose, California San Jose, California San Jose, California [email protected] [email protected] [email protected] ABSTRACT sources for spelling mistakes, e.g. typing too fast, unintentional key In this paper, we reformulated the spelling correction problem as stroke, phonetic ambiguity among others. Lastly, in certain context a machine translation task under the encoder-decoder framework. (e.g. in a search engine) it is not easy to obtain clean training data This reformulation enabled us to use a single model for solving for language model as the input does not follow what is typical in this problem that is traditionally formulated as learning a language natural language. x x model and an error model. This model employs multi-layer recur- Since the goal is to get text that maximize P( | ˜), can we directly rent neural networks as an encoder and a decoder. We demonstrate model this conditional distribution instead? In this work, we ex- the efectiveness of this model using an internal dataset, where the plore this route, which by passes the need to have multiple models training data is automatically obtained from user logs. The model and avoid getting errors from multiple sources. We achieve this ofers competitive performance as compared to the state of the art by applying the sequence to sequence learning framework using methods but does not require any feature engineering nor hand recurrent neural networks [16] and reformulate the spelling cor- tuning between models. rection problem as a neural machine translation problem, where the misspelled input is treated as a foreign language. KEYWORDS Spell Correction; Encoder Decoder; Noisy Channel 2 RELATED WORK ACM Reference Format: Yingbo Zhou, Utkarsh Porwal, and Roberto Konow. 2019. Spelling Correction Spelling correction is used in wide range of applications other than as a Foreign Language. In Proceedings of ACM SIGIR Workshop on eCommerce Web search [4] and e-commerce search such as personal search (SIGIR 2019 eCom). ACM, New York, NY, USA, 4 pages. in email [7] to improve healthcare inquiries [13]. However, noisy channel model or its extensions remain a popular choice for design- 1 INTRODUCTION ing large scale spelling correction system. Gao et al. [6] proposed an extension of the noisy channel model where the language model Having an automatic spelling correction service is crucial for any was scaled to Web scale and a distributed infrastructure to facilitate e-commerce search engine as users often make spelling mistakes such scaling was proposed. They also proposed a phrase based while issuing queries. A correct spelling correction not only re- error model. Similarly, Whitelaw et al. [17] also designed a large duces the user’s mental load for the task, but also improves the scale spelling correction and autocorrection system that did require quality of the search engine as it attempts to predict user’s inten- any manually annotated training data. They also designed their tion. From a probabilistic perspective, let x˜ be the misspelled text large scale system following the noisy channel model where they that we observe, spelling correction seeks to uncover the true text extended one of the earliest error models proposed by Brill et al. [3]. x∗ = arg maxx P(x|x˜). Traditionally, spelling correction problem Spelling correction problem has been formulated in several difer- has been mostly approached by using the noisy channel model ent novel ways. Li et al. [12] used Hidden Markov Models to model [11]. The model consists of two parts: 1) a language model (or spelling errors in a uniied framework. Likewise, Raaijmakers et source model, i.e. P(x)) that represent the prior probability of the al. [14] used deep graphical model for spelling correction. They intended correct input text; and 2) an error model (or channel formulated spelling correction as a document retrieval problem model, i.e. P(x˜ |x)) that represent the process, in which the correct where words are documents and for a misspelled query one has input text got corrupted to an incorrect misspelled text. The i- to retrieve the appropriate document. Eger et al. [5] formulated nal correction is therefore obtained by using the Bayes rule, i.e. spelling correction problem as a subproblem of the more general x∗ = arg maxx P(x)P(x˜ |x). There are several problem with this string-to-string translation problem. Their work is similar to ours approach. First, we need two separate models and the error in esti- in spirit but difers signiicantly in implementation detail. We for- mating one model would afect the performance of the inal output. mulate the spelling correction as a machine translation task and Second, it is not easy to model the channel since there is a lot of to the best of our knowledge no other study has been conducted ∗This work was done when author worked at eBay doing the same. PermissionCopyright © 2019 to make by the digital paper’s orauthors. hard copiesCopying ofpermitted part or for all private of this and work academic for personal purposes. or classroomIn: J. Degenhardt, use is S. granted Kallumadi, without U. Porwal, fee provided A. Trotman that (eds.): copies are not made or distributed forProceedings proit or of commercial the SIGIR 2019 advantage eCom workshop, and that July copies 2019, bear Paris, this France, notice published and the at full citation 3 BACKGROUND AND PRELIMINARIES onhttp://ceur-ws.org the irst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). The recurrent neural network (RNN) is a natural extension to feed- SIGIR 2019 eCom, July 2019, Paris, France forward neural network for modeling sequential data. More for- d © 2019 Copyright held by the owner/author(s). mally, let (x1, x2,..., xT ), xt ∈ R be the input, an RNN update its SIGIR 2019 eCom, July 2019, Paris, France Yingbo Zhou, Utkarsh Porwal, and Roberto Konow internal recurrent hidden states by doing the following computa- where the input is the misspelled text and the output is the cor- tion: responding correct spellings. One challenge for this formulation ht = ψ (ht−1, xt ) (1) is that unlike in machine translation problem, the vocabulary is large but still limited2. However, in spelling correction, the input where ψ is a nonlinear function. Traditionally, in a standard RNN vocabulary is potentially unbounded, which rules out the possibility the ψ is implemented as an aine transformation followed by a of applying word based encoding for this problem. In addition, the pointwise nonlinearity, such as large output vocabulary is a general challenge in neural network = = + + ht ψ (ht−1, xt ) tanh(W xt Uht−1 bh) based machine translation models because of the large Softmax output matrix. In addition, the RNN may also have outputs (y ,y ,...,y ),y ∈ 1 2 T t The input/output vocabulary problem can be solved by using a Ro that can be calculated by using another nonlinear function ϕ character based encoding scheme. Although it seems appropriate yt = ϕ(ht , xt ) for encoding the input, this scheme puts unnecessary burden on the decoder, since for a correction the decoder need to learn the From this recursion, the recurrent neural network naturally models correct spelling of the word, word boundaries, and etc. We choose the conditional probability P(y |x ,..., x ). t 1 t the byte pair encoding (BPE) scheme [15] that strikes the balance One problem with standard RNN is that it is diicult for them to between too large output vocabulary and too much learning burden learn long term dependencies [2, 9], and therefore in practice more for decoders. In this scheme, the vocabulary is built by recursively sophisticated function ψ are often used to alleviate this problem. merging most frequent pairs of strings starting from character, For example the long short term memory (LSTM) [10] is one widely and the vocabulary size is controlled by the number of merging used recursive unit that is designed to learn long term dependencies. iterations. A layer LSTM consists of three gates and one memory cell, the As shown in papers [1], encoding the whole input string to a computation of LSTM is as following1: single ixed length vector is not optimal, since it may not reserve all it = σ(Wixt + Uiht−1 + bi ) (2) the information that is required for a successful decoding. Therefore, ot = σ(Woxt + Uoht−1 + bo) (3) we introduce the attention mechanism from Bahdanau et al.[1] into = + + this model. Formally, the attention model calculates a context vector ft σ(Wf xt Uf ht−1 bf ) (4) ci from the encoding states h1,...,hT and decoder state si−1by ct = ft ⊙ ct−1 + (1 − ft ) ⊙ tanh(Wcxt + Ucht−1 + bc ) (5) T ht = ot ⊙ tanh(ct ) (6) = ci Õ λijhj (9) where W , U , and b represents the corresponding input-to-hidden, j=1 hidden-to-hidden weights and biases respectively. σ(·) denotes the exp{aij } λ = (10) sigmoid function, and ⊙ is the elementwise product. ij T = exp{aik } Another problem when using RNN to solve sequence to sequence Ík 1 a = tanh(W s + W h + b) (11) learning problem is that it is not clear what strategy to apply when ij s i−1 h j the input and output sequence does not share the same length where Ws , Wh are the weight vector for alignment model, and b (i.e. for outputs we have T ′ time steps, which may not equal to T ), denotes the bias. which is the typical setting for this type of tasks. Sutskever et al.

Load more