Steven Minton, InferLink Corporation Sofus Macskassy, Fetch Technologies Peter LaMonica, Air Force Research Laboratories Kane See, InferLink Corporation Craig Knoblock, USC/Information Sciences Inst. Greg Barish, Fetch Technologies Matthew Michelson, Fetch Technologies Ray Liuzzi, Raymond Technologies Steve Minton, InferLink Steven Minton, Stanford University
Stephen Minton, Brain Surgeon Steve Minton, Fetch Technologies
Steven Minton, convicted felon
Steven Minton, Jonosboro High School
Steven Minton, JAIR ¡ Application domain: Arms trafficking ¡ Entity Intelligence Portal (ENTEL) ¡ Entity resolution process ¡ Mistakes: Maintaining referential integrity AIJ JAIR ICML AAAI IEEE Intelligent Systems Grants.gov
Web Monitoring System NASA US Forest Service
Twitter National Interagency Fire Center InciWeb.org
Web Monitoring System Airliners.net Banned Airlines
Aviation Week Twitter ATWonline Air Cargo News Aviation Safety Network
Web Monitoring System
Charged with conspiracy to support a terrorist organization, money laundering, …., Omega Aircompany Irbis Air Ishtar Airlines Mega Airlines Aerocom WING AIR Norse Air Charter Air Cess Galaxy Air Air Foyle Centrafricain Airlines Click Airways Air Bas Air Pass Anikay (Anikai) Airlines Pietersberg Aviation Services Systems Santa Cruz Imperial Great Lake Business Company Balkh Airlines Phoenix Aviation Dolphin Air Air Zory JetLine International Flying Dolphin Sitrat Air MaxAvia San Air General Trading Air Mero African Express Air Leone Inter Transavia
Construction Registration Aircraft Type Previous Reg. Sighting Markings Nbr
SHJ 11May03 no markings UN-75002 Ilyushin 18E 185008603 3C-KKR SHJ 04Nov03 a/w, n/t
UN-75003 Ilyushin 18V 184006903 3C-KKJ SHJ 12Oct03 blue tail, no m/s
green cheatline SHJ 14Sep02 and blue tail UN-75004 Ilyushin 18D 186009202 3C-KKK SHJ 04Nov03 No t/t, blue tail SHJ 28Dec03 all white
UN-75005 Ilyushin 18D 187010204 3C-KKL SHJ 04Nov03
SHJ Oct02 No m/s UN-11007 Antonov 12B 9346509 3C-OOZ SHJ 11May03 all white c/s DXB 12Oct03 no titles
[From Ruudleeuw.com]
Web 3 GUI Source A Entitybase™ 5 Source B (entity resolution) Source C Source D 1 Facts Entity IDs Fetch Agent Platform™ Analytics Engine and (web harvesting) Entity-Resolved Text Facts 4 Content Store 2 Fact Extraction Text (entities, facts, relations from unstructured text) Social Network WatchLists OpenCalais Semantex
¡ Entity resolution: Link incoming records describing the same entity from multiple sources R. Landis, President, Fetch Technologies Robert Landes, CEO, Fetch Software
R. Land, CEO, French Alliance Technologies
¡ Many “common sense” issues, for instance: ▪ Multiple formats for names, addresses, etc. ▪ R.L. Landes vs. Robert Landes ▪ Noisy, incorrect values ▪ Landes vs. Landis ▪ Multi-valued attributes ▪ Landes can be both President and CEO ▪ Aliases and Deception
Cluster is a Composed of single entity multiple data records Confidence Threshold
New Record E1 E5 E4
E6 E3 E2
E7 Transformations Initial: Robert → R. Robert Landes, Spelling: Landes → Landis CEO, Fetch Tech Title alias: CEO → President
R. Landis, E1 President, Fetch Tech E5 E4
E3 E6 Transformations E2 Spelling: Land → Landis R. Land, Spelling: French→ Fetch E7 President, French Tech
Robert Landes, CEO, Fetch Tech P(E1 |D) = P(E1) P(D | E2 ) P(D) R. Landis, E1 President, Fetch Tech E5 E4
E3 E6 P(E2 |D) = P(E2) P(D | E2 ) E2 R. Land, P(D) E7 President, French Tech
R. Landis, E1 President, Fetch Tech E5 E4
E3 E6 P(Enew) P(D | Enew ) ? P(D) E2
E7
Enew New Record New Record New Record ¡ Merge example: § Air Cess and Air Bas aircraft ¡ Split example: § George H. W. Bush and George W. Bush EntityBase
E1 E3 E10
E5 E2 E6
E4 EntityBase
E2 ? E3 E10
E5
E6
E4 EntityBase
E3 E10
E5
E6
E4 EntityBase
D1 D2 D10 D6 E3 D3 D4 E10 D9 D5 D6 E5 D7 D11 E6 D13 D12 D8 E4 EntityBase Analytics
WatchList D1 D2 D10 E3 D3 Kartiga Air (D9) D4 Merpati Airlines (D11) E10 Air Cess (D138) D9 D5 D6 …. E5 D7 D11 E6
D13 D12 D8 E4 EntityBase Analytics
“Social” Network
E1 Publish E1 E3 Merges/Splits E200 E10 E2 E91 E5 E2 E9 E15 E6 E6
E34 E4
EntityBase Analytics
“Social” Network
Publish E1 E3 Merges/Splits E200 E10 E10 E2 E91 E5 E9 E15 E6 E6
E34 E4
¡ Two approaches: § Refer-by-Description ▪ Indirect reference: Point to a cluster member ▪ Advantage: Easy, no synchronization necessary ▪ …But limits information that client can cache § Refer-by-Identifier ▪ Direct reference: Cluster ID ▪ Advantage: Client can cache arbitrary information ▪ …But client must synch with EntityBase and maintain consistency
Client Data Entity Resolution Source Service Client Data Source a Data Client Source
¡ Vision: Entity Resolution in a decentralized world ¡ E.g., the Semantic Web (Glaser, Jaffri & Millard, 2009) ¡ Entity resolution can be hard: “AI Complete” § Arms trafficking domain ¡ Entity merges and splits will occur ¡ Entity resolution clients must be designed to deal with this ¡ Two strategies: Refer-by-Description and Refer-by-Identifier ¡ System status: Being evaluated by AF personnel