<<

Gregory Grefenstette Use of Semantics in Chief Science Officer Real Life Applications Exalead 29 October 2010

Confidential Information What is semantics

Semantics is many things to many people My definition of semantics (general public) Semantics in Search Engines Successful semantics Entity extraction Topic segmentation Affect analysis, domain analysis Semantics in Products Voxalead News Urbanizer Future LOD

10/29/2010 2 Confidential Information Semantic Entities

D O I :

Unified Access to Heterogeneous Audiovisual Archives Y. Avrithis, G. Stamou, M. Wallace et al.

10.3217/jucs-009-06-0510

10/29/2010 3 Confidential Information Semantic store

The Multidimensional Learning Model: A Novel Cognitive Psychology-Based Model for Computer Assisted Instruction in Order to Improve Learning in Medical Students Tarek M. Abdelhamid, M.D., Formerly from the Medical Education Development Office, The University of Auckland School of Medicine and Health Science Figure 3 According to the level of processing approach the semantic (meaningful) coding of information provides the deepest form of processing.

10/29/2010 4 Confidential Information Major levels of linguistic structure

Source National Visualization and Analytics Center: Illuminating the Path: The Research and Development Agenda for Visual Analytics p.110., 2005

Author James J. Thomas and Kristin A. Cook (Ed.) 10/29/2010 5 Confidential Information Syntactic and semantic level description of the surface text

10/29/2010 6 Confidential Information Semantic Level

Announcing the Release of the August 2009 CTP for the Open XML SDK BrianJones 27 Aug 2009 9:21 AM

10/29/2010 7 Confidential Information Semantic Map

LearningTip #50: Strategies Help Reluctant Silent Readers Read to Learn By Joyce Melton Pagés, Ed.D. Educator, President of KidBibs

10/29/2010 8 Confidential Information Semantic Structure

Foundations of language: brain, meaning, grammar, evolution By Ray Jackendoff

http://www.cse.iitk.ac.in/users/amit/books/jackendoff-2002-foundations-of-language.html/

10/29/2010 9 Confidential Information Semantic Levels and Computer Evidence

http://www.anguas.com Mar 18 2010 10/29/2010 10 Confidential Information Semantic Gap

http://mklab.iti.gr/svmcf

10/29/2010 11 Confidential Information Semantic Filtering

M. Naphade, I. Kozintsev and T. Huang, “A Factor Graph Framework for Semantic Video Indexing”, accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology http://domino.research.ibm.com 10/29/2010 12 Confidential Information Semantic Vocabulary

Jingen Liu, Yang Yang and Mubarak Shah, , Learning Semantic Visual Vocabularies Using Diffusion Distance IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

10/29/2010 13 Confidential Information Semantic Needs

http://blog.webgenomeproject.org/internet- hierarchy-of-needs-elaborated-part-4/ 10/29/2010 14 Confidential Information Unified Semantic Ecosystem

http://dita.xml.org/wiki/level-six-universal- semantic-ecosystem 10/29/2010 15 Confidential Information Swoogle Semantic Web indexer

10/29/2010 16 Confidential Information Semantic Web

The element of the Semantic Web Can be encoded in XML Simplicity and mathematical consistency This is called Resource Description Framework (RDF)

http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium

10/29/2010 17 Confidential Information Semantic Web

RDF data...

Subject and object node using same URIs http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium

10/29/2010 18 Confidential Information Semantics Upper level classes Relations between concepts Predicates extracted from text Restrictions on content Labels on visual data « meaning » Shared master data space RDF triples

10/29/2010 19 Confidential Information MY EXPERIENCE WITH SEMANTICS

Confidential Information My thinking in 2007

Language identification tokenization

Word segmentation

Morphological analysis

Part of speech tagging

Lemmatisation Spelling correction, synonym expansion Phrase chunking

Dependency Entity recognition Event analysis Document analysis Summarization clustering Topic identification Classification

10/29/2010 21 Confidential Information In 2008, I joined Exalead

16 milliards de pages web 2 milliards d’images 50 millions de videos

www.exalead.com/search

22 Confidential Information Searching for known items with facets ASIDE ABOUT

Confidential Information 10/29/2010 24 Confidential Information Semantic Dimensions

Facets

Hunting Gathering Shopping

10/29/2010 25 Confidential Information Confidential Information END OF ASIDE BACK TO CLIENTS DESIRE FOR SEMANTICS

Confidential Information I learned that this is what Clients think

Anything added to but not in the original text is semantics

Anything fancier than indexing original strings

Anything that has "language"

10/29/2010 28 Confidential Information Semantics is three things to the ordinary person

Is this an example of that?

Are these two things the same?

What is the relation between these two things?

10/29/2010 29 Confidential Information Semantics is three things to the ordinary person

Is this an example of that? Language identification, entity extraction, sense disambiguation, affect analysis

Are these two things the same? Tokenization, spelling correction, morphological analysis, summarization, synonym expansion, paraphrase, anaphor

What is the relation between these two things? Parsing, event extraction, relation extraction, classification, clustering

10/29/2010 30 Confidential Information Semantics in a Search Engine

Text fields « … the certification test is documented in the report … » Numerical fields 3.14159265 Date 16/11/1957 GPS coordinates, real world coordinates 48.451065619, 1.4392089 Categories Top/Animals/Pets/Dog Value separated fields Color: outside red ; interior: white ; trimming: silver Metadata (attribute:value) Source : dailymotion

10/29/2010 31 Confidential Information Exalead Fields capture most types, documents capture rows

Exalead Field types Text Dates Numbers Categories Exalead Documents Each doc an entity, database rows, or Each doc a text, with entities added in fields

32 10/29/2010 32 Confidential Information Semantics Processes in a Search Engine

autosuggestion language identification lemmatisation related terms related queries synonyms phonetic cross language faceted search Ontology Matcher local search phonetic lemmatisation synonyms stopwords sentiment analysis INDEX named entity

QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 33 10/29/2010 33 Confidential Information AutoSuggestion Example

Auto completion of user query as they type

10/29/2010 34 Confidential Information Auto Suggestion Resource

.... Scores > administrator-set threshold are displayed

... Format for URL suggestions

10/29/2010 35 Confidential Information Synonyms, Example

DB administrator

is defined as synonym of

Database Administrator

This synonymy can be in one direction or both ways

10/29/2010 36 Confidential Information Synonym Resource

Example 1

In this example the query "db administrator" generates query ("db administrator" or "database administrator" or "data base administrator") but "database administrator" and "data base administrator" queries do not perform expansion.

Example 2

In this example any of the queries "db administrator", "database administrator" or"data base administrator" will generate the combined query ("db administrator" or "database administrator" or "data base administrator")

10/29/2010 37 Confidential Information Phonetic Example

10/29/2010 38 Confidential Information Phonetic Resource contextualizable character sequence Phonetic form

comment for linguist developer

# names and old spelling "(oy)", "O.Y" # hoi (hi) For each language, "(oy)", "O.Y" # Plockhoy, Roy, Van Royen "(ooy)", "O.Y" # Verlooy phonetic "(oij)", "O.Y" # rules "(ooij)", "O.Y" # Van Ooijen are "(oeu)", "@" # oeuvre developed ## OU in house. # exceptions for foreign "r(ou)tine", "OU" "g(ou)verneur", "OU" Currently "

# !! OU in names and old spelling CA CS DA DE "(ouw)", "A.OU" # bouw, sjouwer, mouw, stouwen EN ES ET FI

"(ou)", "A.OU" # zout (sel), vrouw (femme), koud (froid), nou (eh bien) FR IT NL NO PL PT RO RU SK SL SV

10/29/2010 39 Confidential Information Language Identification

54 languages are currently detected

10/29/2010 40 Confidential Information Lemmatisation

...

10/29/2010 41 Confidential Information Cross Language Information Retrieval

Afghanistan

10/29/2010 42 Confidential Information Spell Checker

10/29/2010 43 Confidential Information Samples from Google N-gram corpus

Sample of the 3-gram data in this Sample of the 4-gram data in corpus: this corpus:

ceramics collectables collectibles 55 serve as the incoming 92 ceramics collectables fine 130 serve as the incubator 99 ceramics collected by 52 serve as the independent 794 ceramics collectible pottery 50 serve as the index 223 ceramics collectibles cooking 45 ceramics collection , 144 serve as the indication 72 ceramics collection . 247 serve as the indicator 120 ceramics collection 120 serve as the indicators 45 ceramics collection and 43 serve as the indispensable 111 ceramics collection at 52 serve as the indispensible 40 ceramics collection is 68 serve as the individual 234 ceramics collection of 76 serve as the industrial 52 ceramics collection | 59 serve as the industry 607 ceramics collections , 66 serve as the info 42 ceramics collections . 60 serve as the informal 102 ceramics combined with 46 serve as the information 838 serve as the informational 41

Confidential Information FastRules

add a category to a document matching a Jones, G. et al. Non-hierarchic Willett Introduction Cluster analysis, or automatic classification, is a multivariate statistical compiled regular technique ... henceforth a GA, for document clustering. 100 000 queries with 04 nov. 2003 - 20k boolean large multi word expression expression matching MyCategory: MachineLearning 1 Mb / sec / server Source : Vie société : Documentation : Press : Exalead EN-FR Press FastRulesAdvantages excerpts : Old - News Articles : Articles de Very fast rules categorization recherche : NeuralNetworks :clustering Support huge set of rules (> 100k rules) Regular expression on chunk (for example a regular Personnalité : expression on an url, title or text) Gareth Jones, Peter Willett, Edward Arnold, Van Nostrand Reinhold Available as a standalone component (Exa and Java) Lieu : Drawbacks Berlin, Duran, Sheffield, New York, London Limited syntax Organisation : (only AND/OR/NOT operators are supported) British Library, Wellcome, Everitt, American Society, ACM, Springer Rules must be compiled before usage

10/29/2010 45 Confidential Information RulesMatcher, Named Entities

image of use?

10/29/2010 47 Confidential Information SentimentAnalyzer

10/29/2010 48 Confidential Information Categorizer

Business

Consumer Shopping Services Documents D’apprentissage Customer Inqueries Pets Service

Business Signature de la classe

Consumer Shopping Services Nouveau Customer Inqueries Pets document Service

10/29/2010 49 Confidential Information Semantics Processes in a Search Engine

autosuggestion language identification lemmatisation related terms related queries synonyms phonetic spell checker cross language faceted search Ontology Matcher local search phonetic lemmatisation synonyms stopwords sentiment analysis INDEX named entity

QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 50 10/29/2010 50 Confidential Information With documents as containers With rich semantic markup With database offloading database queries generating documents Search engines can become part of a process

Search Based Applications

10/29/2010 51 Confidential Information Search-Based Applications

Search Engines can provide a powerful search-based infrastructure...

… for a new generation of applications

With Offloading Without Offloading Low complexity/costs, High performance/reusability High complexity/costs, Low performance/reusability

10/29/2010 52 Confidential Information Affordance of Search Engines

Web Search Technology has evolved over the past decade to perfect distribution of information

Handle 100s of Offer 99,999% millions of availability database records

Follow a 2-month Match low revenue innovation model roadmap ($0,0001/ session)

Be usable without Support impossible training to forecast traffic

Provide Present sub-second up-to-date response times information10/29/2010 53 Confidential Information 53 Search Search Databases Based engines Applications

10/29/2010 54 Confidential Information Transfer from research to industry SEMANTICS SUCCESS STORIES

Confidential Information Named Entities "Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities." (CoNLL 2002) MUC 1 (1987) thru MUC 6 (1995) Jim bought 300 shares of Acme Corp. in 2006.

http://cs.nyu.edu/cs/faculty/grishman/

10/29/2010 56 Confidential Information Voxalead

10/29/2010 57 Confidential Information

10/29/2010 58 Confidential Information VoxaleadNews Methodology

RSS feed

Video file Meta data Title Date Channel Stream splitter …

Audio track Extract Video re .wav vignettes .jpg encoding .mp4

Speech-to-Text

Text with timetracking .xml

Metadata Push API Named Entities Index Cross Lingual

59

10/29/2010 59 Confidential Information

10/29/2010 60 Confidential Information

10/29/2010 61 Confidential Information

10/29/2010 62 Confidential Information

10/29/2010 63 Confidential Information

10/29/2010 64 Confidential Information Confidential Information Transfer from research to industry SEMANTICS SUCCESS STORY 2

Confidential Information Topic Segmentation

http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf

10/29/2010 68 Confidential Information

Professor Marti Hearst Topic Segmentation

http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf

Professor Marti Hearst

10/29/2010 69 Confidential Information in practice

10/29/2010 70 Confidential Information

10/29/2010 71 Confidential Information

10/29/2010 72 Confidential Information

10/29/2010 73 Confidential Information

10/29/2010 74 Confidential Information Transfer from research to industry SEMANTICS SUCCESS STORY 3

Confidential Information Brief History of Semantic Fields and Affect Analysis

active . spry young slow Deese (1965) semantic axes fast free association experiments old “big-small”, “hot-cold”, etc. passive

Berlin & Kay (1969) Lehrer (1974) semantic fields Levinson (1983) linguistic scale set of alternate or contrastive expressions arranged on an axis by degree of semantic strength red orange yellow green blue red amber tan yellow blue

10/29/2010 76 Confidential Information Affect Lexicons Lasswell Value Dictionary (1969) Eight dimensions: WEALTH, POWER, RECTITUDE, RESPECT, ENLIGHTENMENT, SKILL, AFFECTION, AND WELLBEING with positive or negative orientation e.g., admire: RESPECT (positive) General Inquirer dictionary (Stone, et al. 1965) 9051 headwords 1,915 positive and 2,291 negative words (Pos/Neg) also labels: Active, Passive, ... , Pleasure, Pain, …Human, Animate, …, Region, Route,…, Fetch, Stay, .. http://www.wjh.harvard.edu/~inquirer/inqdict.txt

10/29/2010 77 Confidential Information Clairvoyance Affect Lexicon

"arrogance" sn "superiority" 0.7 0.9 .. "gleeful" adj “happiness” 0.7 0.6 "gleeful" adj “excitement” 0.3 0.6 … 84 pair affect classes (positive/negative)

Confidential Information Finding Emotive Words using Patterns

appear almost feel extremely is so X look very seem too

“appearing so …” “feeling extremely…” “was too…” “looks almost….”

105 patterns with conjugated word forms

10/29/2010 79 Confidential Information Most Common Words from All Patterns

Using snippets from all 105 patterns: 8957 good 2906 important 2506 happy 2455 small 2024 bad 1976 easy 1951 far 1745 difficult 1697 hard 1563 pleased … in all 15111 different, inflected word

10/29/2010 80 Confidential Information SO-PMI of emotive pattern words 37.5 knowlegeable 33.6 tailormade 32.9 eyecatching 29.0 huggable 26.2 surefooted 24.6 timesaving 22.9 personable 21.9 welldone …. -17.1 underdressed -17.5 uncreative -18.1 disapproving -18.5 meanspirited -18.6 unwatchable -22.7 discombobulated

Confidential Information Mood of a city

URBANIZER.COM

Confidential Information http://labs.exalead.com/

10/29/2010 83 Confidential Information

10/29/2010 84 Confidential Information

10/29/2010 85 Confidential Information

10/29/2010 86 Confidential Information

10/29/2010 87 Confidential Information

10/29/2010 88 Confidential Information

10/29/2010 89 Confidential Information

10/29/2010 90 Confidential + Abstraction + Abstraction

"…decor is not particularly elegant…" ambiance ↓ 70 Business Lunch ≈ service ε [0,40] V cuisine ε [45,100] V ambiance ε [45,100]

10/29/2010 91 Confidential Information URBANIZER Domain modeling

Service Domain vocabulary

Cuisine Sentiment for domain

Syntactic patterns

Ambiance

Confidential Information Domain analysis, same as Affect Analysis

Deese (1965) semantic axes free association experiments quick “big-small”, “hot-cold”, etc. homey upscale

active

young slow casual

fast refined old passive leisurely

…spent {two, three} hours…

10/29/2010 93 Confidential Information

10/29/2010 94 Confidential Information

10/29/2010 95 Confidential Information

10/29/2010 96 Confidential Information Another example of domain modeling

10/29/2010 97 Confidential Information

10/29/2010 98 Confidential Information

10/29/2010 99 Confidential Information Foreseeable, near term FUTURE SUCCESSES

Confidential Information Dbpedia, RDF triples

10/29/2010 101 Confidential Information Wikipedia InfoBoxes

10/29/2010 102 Confidential Information Implicit tables

103 10/29/2010 103 Confidential Information Linked Open Data Cloud

2008 2008 2007 2008 2009 2008 2009

10/29/2010 104 Confidential Information LOD2 turns linked data into knowledge

interlink

SILK DXX Engine

create poolparty merge

OntoWiki SemMF

2007 2008 Sigma 20082008 WiQA2008 2009

ORE Virtouso verify classify DL-Learner MonetDB

Sindice

enrich

105 Confidential Information Challenge : searching ad hoc, idiosyncratic spaces

http://www.imrc.kist.re.kr/wiki/LifeLog

10/29/2010 106 Confidential Information Microsoft SenseCam

http://research.microsoft.com/en- us/um/cambridge/projects/sensecam/

http://justgetthere.us/blog/plugin/tag/cameras

10/29/2010 107 Confidential Information http://gadgets.infoniac.com/head-worn- camera-for-special-events.html

10/29/2010 108 Confidential Information Bike My Way

10/29/2010 109 Confidential Information Insta Mapper

10/29/2010 110 Confidential Information Challenge : searching ad hoc, idiosyncratic spaces

Hui Yang and Jamie Callan. A Metric-based http://www.imrc.kist.re.kr/wiki/LifeLog Framework for Automatic Taxonomy Induction. In Proceedings of the 47th Annual Meeting of the Association for Computational (ACL2009), Singapore. Aug 2-7, 2009

10/29/2010 111 Confidential Information Conclusions

Semantics What is this? Are these the same? Who does what to whom? Search Engines now handle rich semantics Search based applications, between DB and IR Semantic Research Successful transfer to industry Near Future Integration crowd source knowledge Wikipedia Linked open data

10/29/2010 112 Confidential Information Thank you

10/29/2010 113 Confidential Information