Multi-Cultural Name Recognition and Data Quality
Total Page:16
File Type:pdf, Size:1020Kb
MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium What’s In a Name? Multi-cultural Name Recognition and Data Quality Presenter: Mala Narasimharajan IBM Corporation Product Marketing Manager © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Objectives • What’s In a Name • Why are names complex • Why even need name recognition • Role of name-based information in data quality © 2008 IBM Corporation 199 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium What’s In A Name? • Name are everywhere • There are multiple variations of Andrew, Manual, John, Jeurgen – different ways of spelling the same name • And every time; someone applies for a checking account, savings account, transfers money, cashes a check, boards a flight, or applies for credit these names must be looked up! Nicknames, Drew, Andraz, Drue Andrewes, Andros, Drue, Name., Rev, Order, Hussein, Mohammed Shortened names, Andy, Andrews, Andru, Ohndrae, Abu Ali Andrey, Andruw, Ohndre, Prefixes, Abdul, Fitz, O', De La, Andrezj, Andrzej, Ondre, Titles, Dr Haj, Sri., Col Andrian, Andy, Ondrei, Phonetics, Worchester, Wooster, “Worcester” Andriel, Drew, Ondrej, © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Chaikovsky with a “T” Names of people, places, and businesses – There are no dictionaries for them, – There’s no way to look up a name and say this is wrong or right – We run into very, very intractable problems [in transliteration from] other writing systems So if somebody’s looking for a name coming from the Cyrillic—for example, Tchaikovsky with a ‘T’ in front of it, matching will be very difficult as this is a French, not Russian transcription of his name And thats just the surname, never mind the multiple variations of first: Tchaikovsky Chaikovsky 9 Piotr Illich, 9 Pyotr Ilich, 9 Peter Ilich © 2008 IBM Corporation 200 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium The Same Name across the Arabic World © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Example: The Same Name across SE Asia Zhang Qiusu Chang Ch’iu- China Su Taiwan Chiusu Sae Myanmar Chang Laos Hong Kong (Burma) Macau Philippines Thailand Cheung Yau Cambodia So Vietnam Malaysia Cheung Yau Singapore So Indonesia © 2008 IBM Corporation 201 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Simple Name Recognition - A Critical Challenge For Australia China Taiwan Myanmar “In the Australians’ case they have welcomed Laos Hong Kong (Burma) Macau Thailand Philippines 123,424 new immigrants in 2004-2005, the Cambodia highest number in more than 15 years” Vietnam Malaysia Singapore The Problem: Indonesia • Many have no proof of identity • Many have disposed of all personal papers en route to Australia • It is not uncommon for them to 'change identity' either during the journey or processing, in the hope that it may be easier to stay if they claim a different nationality The Question: • This raises legitimate questions about their intentions © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Sure You Can Standardize An Address Or Phone Number But Names Have Always Been A Challenge Address & ## Standardization Name Standardization? 4737 Simeron Drive Teddd Kennedy Edl Kennedy Easton, MA 02334 xEd Kennedy Ed Kennedy )978)36 5-5312 Kim June Joe Kimie Spacek Theodore? Edward? Kimberly? 4737 Cimarron Drive Easton, MA 02334 • No single “dictionary” of “right” spellings (978) 365-5312 • No one-to-one correspondence among nicknames to names • Poor data quality is common • Cultural syntax variations are problematic © 2008 IBM Corporation 202 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Data Collection Forms Create Name Parsing Errors Name Formats •Last Name _____________ First Name _____________ Title _____________________ •Name ___________________ • Middle Name ___________ Last Name _____________ First Name ____________ • Family Name _____________ Given Name _____________ Middle Initial _____________ © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Common Problems with Names Database Problems Exact Match Name Variant 2 Soundex (1918) NYSIIS (1963) Name Name Variant 3 Variant 1 “Home - Grown” Same name is represented differently across corporate databases – standard approaches to match them are often inadequate © 2008 IBM Corporation 203 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Why do you need Name Recognition ? • Ability to recognize multi-cultural names from around the world – and provide insight, analysis and matching capabilities • For organizations where names – personal or business constitute a vital data element • Dealing with name is essential part of data quality • Applications and solutions that rely on identity – rely on names • Names are important to data quality as they enhance identification, merging, standardization and enrichment steps • Many organizations must be able to recognize and match names (e.g., banks, insurance companies, law enforcement, airlines) • Global business interactions demand ability to process multi-cultural names with greater accuracy © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Costs of Dirty Data Scrap and rework 83% of data Increased costs integration projects either overrun or fail Lack of consumer confidence Lost Inaccurate or opportunities incomplete data is a leading cause of failure Low data quality costs in business-intelligence companies $611 billion and CRM projects annually 25% of time is spent clarifying Undetected defects will cost 10 to 100 bad data times as much to fix upstream © 2008 IBM Corporation 204 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Typical Data Quality Process • Standardization • Verification Reference Data • Identify Matches & Duplicates • Manage Matches • Enrich Data + © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Typical Data Quality Process – enhanced by Name Recognition • Name Standardization • Verification Reference Data • Identify Matches & Duplicates • Manage Name Equivalencies • Enrichment based on Name Data + © 2008 IBM Corporation 205 MIT Information Quality Industry Symposium, July 16-17, 2008 Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Capabilities of Global Name Recognition How does this apply to Data Quality ? Transliterate Parse Classify Genderize Name Variations Search Incoming name Analysis and Identifies culture Identifies the Ranked name Ranked list of in non-Roman remediation of and the proper most likely variations used potential matches native script name data search techniques gender of a as query terms from one or more name for search data sources Standardization Enrichment © 2008 IBM Corporation Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Conclusions •Data quality and name recognition are complementary to each other •Today- there are strong integration synergies between Global Name Recognition and QualityStage •Global Name Recognition abilities significantly enhance data quality efforts by: – Providing enhanced insight, matching and standardization around names from around the world – Providing better match rates and higher precision and recall © 2008 IBM Corporation 206.