Spelling Alteration for Web Search Workshop

Spelling Alteration for Web Search Workshop City Center – Bellevue, WA July 19, 2011 Table of Content Acknowledgements.......................................................................... 2 Description, Organizers…………………………………………………............. 3 Schedule……..................................................................................... 4 A Data-Driven Approach for Correcting Search Quaries Gord Lueck ....................................................................................... 6 CloudSpeller: Spelling Correction for Search Queries by Using a Unified Hidden Markov Model with Web-scale Resources Yanen Li, Huizhong Duan, ChengXiang Zhai……………………………………................ 10 qSpell: Spelling Correction of Web Search Queries using Ranking Models and Iterative Correction Yasser Ganjisaffar, Andrea Zilio, Sara Javanmardi, Inci Cetindil, Manik Sikka, Sandeep Katumalla, Narges Khatib, Chen Li, Cristina Lopes.............................................................................................…15 TiradeAI: An Ensemble of Spellcheckers DanŞtefănescu, Radu Ion, Tiberiu Boroş....................................................... 20 Spelling Generation based on Edit Distance Yoh Okuno………………........................................................................ 25 A REST-based Online English Spelling Checker “Pythia” Peter Nalyvayko............................................................................... 27 1 Acknowledgements The organizing committee would like to thank all the speakers and participants as well as everyone who submitted a paper to the Spelling Alteration for Web Search Workshop. Special thanks to the Speller Challenge winners for helping build an exciting and interactive workshop. Last but not least thank you to the Bing team to bring in their experiences working on real world large scale web scenarios. Organizing committee 2 Description The Spelling Alteration for Web Search workshop, co-hosted by Microsoft Research and Microsoft Bing, addresses the challenges of web scale Natural Language Processing with a focus on search query spelling correction. The goal of the workshop is the following: provide a forum for participants in the Speller Challenge (details at http://www.spellerchallenge.com, with a submission deadline for the award competition on June 2, 2011) to exchange ideas and share experience official award ceremony for the prize winners of the Speller Challenge engage the community on future research directions on spelling alteration for web search Organizers Evelyne Viegas, Jianfeng Gao, Kuansan Wang Microsoft Research, One Microsoft Way Redmond, WA 98052 Jan Pedersen Microsoft, One Microsoft Way Redmond, WA 98052 3 Schedule Time Event/Topic Presenter Chair 08:30 Breakfast 09:00 Opening Evelyne Viegas - Microsoft Research Jianfeng Gao 09:10 Award presentation Harry Shum - Corporate Vice President Microsoft 09:30 Report Kuansan Wang - Microsoft Research 10:00 Break – snacks 10:30 A Data-Driven Approach for Gord Lueck - Independent Software Jianfeng Correcting Search Queries Developer, Canada Gao 10:50 CloudSpeller: Spelling Yanen Li, Huizhong Duan, ChengXiang Correction for Search Zhai - UIUC, Illinois, USA Queries by Using a Unified Hidden Markov Model with Web-scale Resources 11:10 qSpell: Spelling Correction Yasser Ganjisaffar, Andrea Zilio, Sara of Web Search Queries Javanmardi, Inci Cetindil, Manik Sikka, using Ranking Models and Sandeep Katumalla, Narges Khatib, Chen Iterative Correction Li, Cristina Lopes - University Of California, Irvine, USA 11:30 TiradeAI: An Ensemble of Dan Stefanescu, Radu Ion, Tiberiu Boros Spellcheckers - Research Institute for Artificial Intelligence, Romania 11:50 Spelling Generation based Yoh Okuno - Yahoo Corp. Japan on Edit Distance 12:00 Lunch 4 13:00 A REST-based Online Peter Nalyvayko - Analytical Graphics Kuansan English Spelling Checker Inc., USA Wang "Pythia" 13:15 Why vs. HowTo: Maintaining Dan Stefanescu, Radu Ion - Research the right balance Institute for Artificial Intelligence, Romania 13:30 Panel Discussion Ankur Gupta, Li-wei He - Microsoft Gord Lueck, Yanen Li, Yasser Ganjisaffar, Dan Stefanescu - Speller Challenge Winners 14:30 Wrap up 5 A Data-Driven Approach for Correcting Search Queries A Submission to the Speller Challenge ∗ Gord Lueck [email protected] ABSTRACT Another useful metric to measure input queries woul dbe to Search phrase correction is the challenging problem of propos- determine the number of relevan tsearch results the query ing alternative versions of search queries typed into a web has. However, without the luxury of a large search en- search engine. Any number of different approaches can be gine from which to measure number of results for candidate taken to solve this problem, each having different strengths. phrases, this implementation relies upon probability data Recently, the availability of large datasets that include web solely from Microsoft via their web-ngram service[6]. corpora have increased the interest in purely data-driven spellers. In this paper, a hybrid approach that uses tradi- This paper gives an overview of the methods used to create tional dictionary-based spelling tools in combination with such a search phrase correction service. An overview of the data-driven probabilistic techniques is described. The per- problem as formulated by the challenge administrators is formance of this approach was sufficient to win Microsoft's given, including an outline of some of the assumptions and Speller Challenge in June, 2011. models that were used in the creation of this algorithm. A key component to the entry is an error model that takes 1. INTRODUCTION into consideration probabilities as obtained through a his- Spelling correctors for search engines have the difficult task torical bing search query dataset while estimating frequen- of accepting error-prone user input and deciphering if an er- cies of errors within search queries. The widely available ror was made, and if so suggesting plausable alternatives hunspell dictionary[1] speller is used for checking for the ex- to the phrase. It is a natural extension to the existing istence of a dictionary word matching each word in the input spell checkers that are common in document editors today, query, and for suggesting potential corrections to individual with the added expectation that suggestions leverage ad- words. Any algorithmic parameters present in the final al- ditional context provided by other terms in the query. In gorithm are also described, and methods used to fix those addition, it is common for search phrases to legitimately in- parameters are described. Some specific optimizations for clude proper nouns, slang, multiple languages, punctuation, the contest are described, and recommendations are made and in some cases complete sentences. The goal of a good for improvement of the expected recall metric of the contest search phrase corrector would be to take all factors into con- evaluator. sideration, evaluate the probability of the submitted query, and to suggest alternative queries when it is probable that We describe our approach to the problem, some assump- the corrected version has a higher likelihood than the input tions that we made, and describe an error model that we phrase. formulated to approximate frequencies and locations of errors within search queries. Our algorithm is described, as A good spelling corrector should only act when it is clear well as methods for calculating parameters that were used that the user made an error. The speller described herein in the final submission to the challenge[4]. also errs on the side of not acting when it is unclear if a suggested correction is highly probable. Common speller implementations tend not to suggest phrases that have sim- 2. PROBLEM FORMULATION ilar probability if the input query is sane. The algorithm is evaluated according to the published EF1 metric. Suppose the spelling algorithm returns a set of vari- ∗ ations C(q), each having posterior probabilities P(cjq). S(q) Corresponding Author is the set of plausable spelling variations as determined by a human, and Q is the set of queries in the evaluation set. The expected precision, EP , is 1 X X EP = I (c; q)P (cjq) jQj P q2Q c2C(q) and, the expected recall is 1 X X IR(C(q); a) ER = jQj jS(q)j q2Q a2S(q) 6 1 1 1 = 0:5( + ) 3.1.1 Trimming the Search Space EF 1 EP ER The generation of alternative terms was repeated for each term t in the search phrase. In doing this, the algorithm where the utility functions are: generates a potentially large number of candidate phrases, IR(C(q); a) = 1ifa 2 C(q) for a 2 S(q); 0 otherwise as the corrected phrase set consists of the set of all possible combinations of candidate terms. That is a maximum of t IP (c; q) = 1ifc 2 S(q); 0 otherwise (nt +1) possible corrections. For a typical query of 5 terms, each having 4 suggestions, that means checking 3125 differ- For a given search query q, the algorithm will return a set ent combinations of words for probabilities. If more than of possible corrections, each with an assigned probability, 100 potential corrections were found, the search space was such that the probabilities P(cjq) add to unity for c 2 Q. reduced by halving the number of individual term queries A successful spelling algorithm will return a set of possible until the number fell below 100. In practice, this trimming corrections and probabilities that maximizes EF1. step was only rarely performed, but was left in to meet basic latency requirements for a web service. In practice

Spelling Alteration for Web Search Workshop

Automatic Wordnet Mapping Using Word Sense Disambiguation*

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020

Learning to Read by Spelling Towards Unsupervised Text Recognition

Detecting Personal Life Events from Social Media

Universal Or Variation? Semantic Networks in English and Chinese

Spelling Correction: from Two-Level Morphology to Open Source

Unified Language Model Pre-Training for Natural

Easychair Preprint Feedback Learning: Automating the Process

Hunspell – the Free Spelling Checker

NL Assistant: a Toolkit for Developing Natural Language: Applications

Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R