Metasoundex Phonetic Matching for English and Spanish
Total Page:16
File Type:pdf, Size:1020Kb
Global Journal of Enterprise Information System DOI: 10.18311/gjeis/2018/19822 MetaSoundex Phonetic Matching for English and Spanish K. Koneru* and C. Varol Department of Computer Science, Sam Houston State University, TX 77341, USA; [email protected], [email protected] Abstract Researchers confront major problems while searching for various kinds of data in large imprecise databases, as they are not spelled - gle, pronunciation of words was considered as one of the practices to solve the problem effectively. The technique used to acquire correctly or in the way they were expected to be spelled. As a result, they cannot find the word they sought. Over the years of strug Metaphone, Caverphone, DMetaphone, Phonex etc., are also used for information retrieval in different environments. The main con- words based on sounds is known as “Phonetic Matching”. Soundex was the first algorithm developed and other algorithms such as and Spanish languages. The newly developed MetaSoundex algorithm addresses the limitations of well-known phonetic matching tribution of this paper is to analyze and implement the newly proposed MetaSoundex algorithm for fixing ill-defined data in English Soundex and Metaphone algorithms and has higher precision compared to Soundex, thus reducing the noise in the considered arena. techniques, Metaphone and Soundex. Specifically, the new algorithm provided results that are more accurate compared to both Keywords: Information Retrieval, Metaphone, Metasoundex, Misspelled Words, Phonetic Matching, Soundex Paper Code (DOI): 19822; Originality Test Ratio: 19%; Submission Online: 23-Jan-2018; Manuscript Accepted: 30-Jan-2018; Originality Check: 03-Feb-2018; Peer Reviewers Comment: 13-Feb-2018; Double Blind Reviewers Comment: 20-Mar-2018; Author Revert: 24-Mar-2018; Camera-Ready-Copy: 27-Mar-2018; Editorial Board Excerpt: 30-Mar-2018. Editorial Board Excerpt: Initially at the Time of Submission (ToS) submitted paper had a 19% plagiarism which is an accepted percent- age for publication. The editorial board is of an observation that paper had a subsequent surveillance by the blind reviewer’s which at a later stages had been rectified and amended by an authors ( Koneru and Varol ) in various phases as and when required to do so. The reviewer’s had in an initial stages comment with minor revision with a following remark which at a short span restructured by an authors. The comments related to this manuscript is extremely noticeable both subject-wise and research wise by the reviewers during evaluation and further at blind review process too. All the comments had been shared at a variety of dates by the authors’ in due course of time and same had been integrated by the author in addition. By and large all the editorial and reviewer’s comments had been incorporated in this very paper at the end and further the paper had been earmarked and decided under “Empirical Research Paper” category as its highlights and emphasize the pure and first hand information in relation to MetaSoundex Phonetic Matching for English and Spanish. 1. Introduction Soundex was the naive algorithm proposed and other algo- rithms like Metaphone, Caverphone, DMetaphone, Phonex etc., Information deterioration is an intensive problem for organi- are also used for retrieving nearest matches for misspelled data. zations in the present era. With the increase in the amount of As per the research15, it was clearly observed that there is no con- information saved day by day, there is a desperate need for locat- crete technique for retrieving nearest matches. Soundex has high ing the mistyped data. Organizations are facing great challenge accuracy than other algorithms but has huge overhead due to its to maintain the quality of data in information systems with vari- high false positives. Metaphone has high efficiency in spite of its ous sources of data damage. Whenever the data is assimilated low accuracy, due to its low overhead. Hence, this paper mainly from multiple sources, it is a challenge to recognize the duplicate involves the proposal, implementation, and analysis of a hybrid information due to the existence of misspelled data for the same algorithm, MetaSoundex. It is observed that MetaSoundex has an record. As a result, the information of organization always ends accuracy of 84.5% for a real-life dataset, which is improved over up at risk. To address these challenges, techniques such as string Soundex (80%) and Metaphone (58%). matching, phonetic matching, and data linkage have been used. The rest of this paper is structured as follows. Differences Apart from other techniques that depend on variations in letters, between string matching and phonetic matching are discussed phonetic matching is mainly contingent on variations in sound in next section. Section three describes background of pho- to identify the misspelled data. As a result, the misspelled data netic matching algorithms. Section four explains in detail about from multilingual sources can also be identified using phonetic Soundex, Metaphone, and MetaSoundex, which is the initial matching. contribution of this paper. Section five defines the evaluation *Author for correspondence MetaSoundex Phonetic Matching for English and Spanish Table 1. String matching vs phonetic matching26,29 String Matching Phonetic Matching Matching Matches data based on patterns of substrings Matches data based on the similar pronunciations Involves Addition, Deletion or Substitution of Letters Conversion of data to phonetic patterns Applications Applied in Search Engines, Bio-Informatics, Used in name retrieval in enquiry lines, record linkage and fraud spell checkers, digital forensics etc. detection. Gaining its importance in spell checkers and; search engines. Prominence Mainly used for matching names and nouns Can be used in multi-lingual environment, where diversities in from English Language pronunciation or writing styles may be present. The study in this paper particularly focuses on phonetic matching because: a. It is not explored as much as string matching and still relies on old techniques. b. Of increase in the voice-to-text translation applications, where phonetic matching plays a crucial role. metrics and describes the experimental setup used in this study. 3.1 Evolution of Phonetic Matching The main contribution of this paper is presented in section six in Algorithms which the results of newly proposed MetaSoundex are analyzed and compared with the existing algorithms. Finally, this paper is The evolution of phonetic matching has come into frame when 5 concluded and future work is pointed out in section seven. there is a hardship in the retrieval of information . The main goal of phonetic matching algorithms is to encode homophones to the same representation so that they can be matched despite of minor 2. String Matching vs Phonetic differences in spelling1,10. The technique of obtaining words using Matching sounds was used in the US census since the late 1890’s, but a con- crete solution to this was first proposed and patented by Robert Phonetic matching is one of the important techniques that C. Russell in 1912 as Soundex algorithm27. The background of plays a major role in variety of fields such as digital investiga- various phonetic matching algorithms is discussed. tion involving voice memos and voice mails, transcriptions, and voice apps such as ‘Siri’, ‘Google’ voice etc.3 to provide suggestions 3.1.1 Soundex for misspelled words. Phonetic comparison meticulously identi- The earliest algorithm in the literature is Soundex developed by fies the words that are most likely to sound similar. It obtains Odell and Robert C. Russell in 1912, which generates a four-digit the quantitative analysis of pronunciations32 between speech code retaining its first letter. The authors patented the algorithm forms and spellings of words, whereas, string matching mainly in 191822. The process mainly encodes consonants while a vowel involves insertion, deletion, and substitution of letters to find the is not encoded unless it is the first letter. Arguably, Soundex is near matches26,29. Table 1 describes the most common differences one of the most widely known of all phonetic algorithms. It is between string matching and phonetic matching. used as a standard feature in applications like mySQL, oracle, etc. Because of the few disadvantages like dependency on the 3. Phonetic Matching Algorithms first letter, failure of detection of silent consonants, limit to the four characters of encoding, and high overhead in the retrieved Information retrieval is one of the major viewpoints of data mining matches, Soundex can only be used in applications where high 28 application areas . However, the information may not be consist- false positives and false negatives can be tolerated27. ent over the considered arena due to various causes. The different sources of variations can be spelling variations (typographical 3.1.2 Beider-Morse Phonetic Matching (BMPM) errors, substituted letters or by addition or omission of letters), phonetic variations (discrepancies in phonetic structure of words), Beider and Morse implemented an improvement to Soundex to double names or double first names (names having more than one reduce the number of false positives and false negatives, known as word), change of name27 (individual undergoes change of name). Beider-Morse Phonetic Matching (BMPM). Beider, et al, has also Of the different criteria mentioned above, the research in mentioned that the algorithm is extended to languages other than phonetic variations led to the development of phonetic matching English, with the application of some generic rules to obtain the 5 algorithms, which obtains worthwhile approximate matches to phonetic codes . Varol, et al, discussed BMPM as a hybrid tech- the misspelled words. nique with a 6-letter encoded code in which the percentage of 2 Vol 10 | Issue 1 | January-March 2018 | www.informaticsjournals.com/index.php/gjeis GJEIS | Print ISSN: 0975-153X | Online ISSN: 0975-1432 K. Koneru and C. Varol Empirical Research Paper irrelevant matches can be abated by 70%33. A set of tables repre- of the word.