MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
What’s In a Name? Multi-cultural Name Recognition and Data Quality
Presenter: Mala Narasimharajan IBM Corporation Product Marketing Manager
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Objectives
• What’s In a Name • Why are names complex • Why even need name recognition • Role of name-based information in data quality
© 2008 IBM Corporation
199 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
What’s In A Name? • Name are everywhere • There are multiple variations of Andrew, Manual, John, Jeurgen – different ways of spelling the same name • And every time; someone applies for a checking account, savings account, transfers money, cashes a check, boards a flight, or applies for credit these names must be looked up!
Nicknames, Drew, Andraz, Drue Andrewes, Andros, Drue, Name., Rev, Order, Hussein, Mohammed Shortened names, Andy, Andrews, Andru, Ohndrae, Abu Ali Andrey, Andruw, Ohndre, Prefixes, Abdul, Fitz, O', De La, Andrezj, Andrzej, Ondre, Titles, Dr Haj, Sri., Col Andrian, Andy, Ondrei, Phonetics, Worchester, Wooster, “Worcester” Andriel, Drew, Ondrej,
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Chaikovsky with a “T” Names of people, places, and businesses – There are no dictionaries for them, – There’s no way to look up a name and say this is wrong or right – We run into very, very intractable problems [in transliteration from] other writing systems So if somebody’s looking for a name coming from the Cyrillic—for example, Tchaikovsky with a ‘T’ in front of it, matching will be very difficult as this is a French, not Russian transcription of his name And thats just the surname, never mind the multiple variations of first: Tchaikovsky Chaikovsky 9 Piotr Illich, 9 Pyotr Ilich, 9 Peter Ilich
© 2008 IBM Corporation
200 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium The Same Name across the Arabic World
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Example: The Same Name across SE Asia
Zhang Qiusu
Chang Ch’iu- China Su Taiwan Chiusu Sae Myanmar Chang Laos Hong Kong (Burma) Macau Philippines Thailand Cheung Yau Cambodia So Vietnam Malaysia Cheung Yau Singapore So
Indonesia
© 2008 IBM Corporation
201 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Simple Name Recognition - A Critical Challenge For Australia
China
Taiwan Myanmar “In the Australians’ case they have welcomed Laos Hong Kong (Burma) Macau Thailand Philippines 123,424 new immigrants in 2004-2005, the Cambodia highest number in more than 15 years” Vietnam Malaysia Singapore The Problem:
Indonesia • Many have no proof of identity • Many have disposed of all personal papers en route to Australia • It is not uncommon for them to 'change identity' either during the journey or processing, in the hope that it may be easier to stay if they claim a different nationality
The Question: • This raises legitimate questions about their intentions © 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Sure You Can Standardize An Address Or Phone Number But Names Have Always Been A Challenge Address & ## Standardization Name Standardization?
4737 Simeron Drive Teddd Kennedy Edl Kennedy Easton, MA 02334 xEd Kennedy Ed Kennedy )978)36 5-5312 Kim June Joe Kimie Spacek
Theodore? Edward? Kimberly? 4737 Cimarron Drive Easton, MA 02334 • No single “dictionary” of “right” spellings (978) 365-5312 • No one-to-one correspondence among nicknames to names • Poor data quality is common • Cultural syntax variations are problematic © 2008 IBM Corporation
202 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Data Collection Forms Create Name Parsing Errors Name Formats •Last Name ______First Name ______Title ______
•Name ______
• Middle Name ______Last Name ______First Name ______
• Family Name ______Given Name ______Middle Initial ______
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Common Problems with Names
Database Problems Exact Match
Name Variant 2 Soundex (1918)
NYSIIS (1963) Name Name Variant 3 Variant 1 “Home - Grown” Same name is represented differently across corporate databases – standard approaches to match them are often inadequate
© 2008 IBM Corporation
203 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Why do you need Name Recognition ?
• Ability to recognize multi-cultural names from around the world – and provide insight, analysis and matching capabilities • For organizations where names – personal or business constitute a vital data element • Dealing with name is essential part of data quality • Applications and solutions that rely on identity – rely on names • Names are important to data quality as they enhance identification, merging, standardization and enrichment steps • Many organizations must be able to recognize and match names (e.g., banks, insurance companies, law enforcement, airlines) • Global business interactions demand ability to process multi-cultural names with greater accuracy
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Costs of Dirty Data Scrap and rework 83% of data Increased costs integration projects either overrun or fail
Lack of consumer confidence
Lost Inaccurate or opportunities incomplete data is a leading cause of failure Low data quality costs in business-intelligence companies $611 billion and CRM projects annually 25% of time is spent clarifying Undetected defects will cost 10 to 100 bad data times as much to fix upstream © 2008 IBM Corporation
204 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Typical Data Quality Process
• Standardization
• Verification Reference Data
• Identify Matches & Duplicates • Manage Matches
• Enrich Data +
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Typical Data Quality Process – enhanced by Name Recognition
• Name Standardization
• Verification Reference Data
• Identify Matches & Duplicates • Manage Name Equivalencies
• Enrichment based on Name Data +
© 2008 IBM Corporation
205 MIT Information Quality Industry Symposium, July 16-17, 2008
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Capabilities of Global Name Recognition How does this apply to Data Quality ?
Transliterate Parse Classify Genderize Name Variations Search Incoming name Analysis and Identifies culture Identifies the Ranked name Ranked list of in non-Roman remediation of and the proper most likely variations used potential matches native script name data search techniques gender of a as query terms from one or more name for search data sources
Standardization Enrichment
© 2008 IBM Corporation
Information On Demand IBM The MIT 2008 Information Quality Industry Symposium
Conclusions
•Data quality and name recognition are complementary to each other •Today- there are strong integration synergies between Global Name Recognition and QualityStage •Global Name Recognition abilities significantly enhance data quality efforts by: – Providing enhanced insight, matching and standardization around names from around the world – Providing better match rates and higher precision and recall
© 2008 IBM Corporation
206