<<

MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

What’s In a Name? Multi-cultural Name Recognition and Data Quality

Presenter: Mala Narasimharajan IBM Corporation Product Marketing Manager

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Objectives

• What’s In a Name • Why are names complex • Why even need name recognition • Role of name-based information in data quality

© 2008 IBM Corporation

199 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

What’s In A Name? • Name are everywhere • There are multiple variations of , Manual, John, Jeurgen – different ways of spelling the same name • And every time; someone applies for a checking account, savings account, transfers money, cashes a check, boards a flight, or applies for credit these names must be looked up!

Nicknames, Drew, Andraz, Drue Andrewes, Andros, Drue, Name., Rev, Order, Hussein, Mohammed Shortened names, Andy, Andrews, Andru, Ohndrae, Abu Ali Andrey, Andruw, Ohndre, Prefixes, Abdul, Fitz, O', De La, Andrezj, , Ondre, Titles, Dr Haj, Sri., Col Andrian, Andy, Ondrei, Phonetics, Worchester, Wooster, “Worcester” Andriel, Drew, Ondrej,

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Chaikovsky with a “T” Names of people, places, and businesses – There are no dictionaries for them, – There’s no way to look up a name and say this is wrong or right – We run into very, very intractable problems [in transliteration from] other writing systems So if somebody’s looking for a name coming from the Cyrillic—for example, Tchaikovsky with a ‘T’ in front of it, matching will be very difficult as this is a French, not Russian transcription of his name And thats just the surname, never mind the multiple variations of first: Tchaikovsky Chaikovsky 9 Piotr Illich, 9 Pyotr Ilich, 9 Peter Ilich

© 2008 IBM Corporation

200 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium The Same Name across the Arabic World

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Example: The Same Name across SE Asia

Zhang Qiusu

Chang Ch’iu- China Su Taiwan Chiusu Sae Myanmar Chang Laos Hong Kong (Burma) Macau Philippines Thailand Cheung Yau Cambodia So Vietnam Malaysia Cheung Yau Singapore So

Indonesia

© 2008 IBM Corporation

201 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Simple Name Recognition - A Critical Challenge For Australia

China

Taiwan Myanmar “In the Australians’ case they have welcomed Laos Hong Kong (Burma) Macau Thailand Philippines 123,424 new immigrants in 2004-2005, the Cambodia highest number in more than 15 years” Vietnam Malaysia Singapore The Problem:

Indonesia • Many have no proof of identity • Many have disposed of all personal papers en route to Australia • It is not uncommon for them to 'change identity' either during the journey or processing, in the hope that it may be easier to stay if they claim a different nationality

The Question: • This raises legitimate questions about their intentions © 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Sure You Can Standardize An Address Or Phone Number But Names Have Always Been A Challenge Address & ## Standardization Name Standardization?

4737 Simeron Drive Teddd Kennedy Edl Kennedy Easton, MA 02334 xEd Kennedy Ed Kennedy )978)36 5-5312 Kim June Joe Kimie Spacek

Theodore? Edward? Kimberly? 4737 Cimarron Drive Easton, MA 02334 • No single “dictionary” of “right” spellings (978) 365-5312 • No one-to-one correspondence among nicknames to names • Poor data quality is common • Cultural syntax variations are problematic © 2008 IBM Corporation

202 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Data Collection Forms Create Name Parsing Errors Name Formats •Last Name ______First Name ______Title ______

•Name ______

• Middle Name ______Last Name ______First Name ______

• Family Name ______Given Name ______Middle Initial ______

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Common Problems with Names

Database Problems Exact Match

Name Variant 2 Soundex (1918)

NYSIIS (1963) Name Name Variant 3 Variant 1 “Home - Grown” Same name is represented differently across corporate databases – standard approaches to match them are often inadequate

© 2008 IBM Corporation

203 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Why do you need Name Recognition ?

• Ability to recognize multi-cultural names from around the world – and provide insight, analysis and matching capabilities • For organizations where names – personal or business constitute a vital data element • Dealing with name is essential part of data quality • Applications and solutions that rely on identity – rely on names • Names are important to data quality as they enhance identification, merging, standardization and enrichment steps • Many organizations must be able to recognize and match names (e.g., banks, insurance companies, law enforcement, airlines) • Global business interactions demand ability to process multi-cultural names with greater accuracy

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Costs of Dirty Data Scrap and rework 83% of data Increased costs integration projects either overrun or fail

Lack of consumer confidence

Lost Inaccurate or opportunities incomplete data is a leading cause of failure Low data quality costs in business-intelligence companies $611 billion and CRM projects annually 25% of time is spent clarifying Undetected defects will cost 10 to 100 bad data times as much to fix upstream © 2008 IBM Corporation

204 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Typical Data Quality Process

• Standardization

• Verification Reference Data

• Identify Matches & Duplicates • Manage Matches

• Enrich Data +

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium Typical Data Quality Process – enhanced by Name Recognition

• Name Standardization

• Verification Reference Data

• Identify Matches & Duplicates • Manage Name Equivalencies

• Enrichment based on Name Data +

© 2008 IBM Corporation

205 MIT Information Quality Industry Symposium, July 16-17, 2008

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Capabilities of Global Name Recognition How does this apply to Data Quality ?

Transliterate Parse Classify Genderize Name Variations Search Incoming name Analysis and Identifies culture Identifies the Ranked name Ranked list of in non-Roman remediation of and the proper most likely variations used potential matches native script name data search techniques gender of a as query terms from one or more name for search data sources

Standardization Enrichment

© 2008 IBM Corporation

Information On Demand IBM The MIT 2008 Information Quality Industry Symposium

Conclusions

•Data quality and name recognition are complementary to each other •Today- there are strong integration synergies between Global Name Recognition and QualityStage •Global Name Recognition abilities significantly enhance data quality efforts by: – Providing enhanced insight, matching and standardization around names from around the world – Providing better match rates and higher precision and recall

© 2008 IBM Corporation

206