Gregory Grefenstette Use of Semantics in Chief Science Officer Real Life Applications Exalead 29 October 2010
Confidential Information What is semantics
Semantics is many things to many people My definition of semantics (general public) Semantics in Search Engines Successful semantics Entity extraction Topic segmentation Affect analysis, domain analysis Semantics in Products Voxalead News Urbanizer Future LOD
10/29/2010 2 Confidential Information Semantic Entities
D O I :
Unified Access to Heterogeneous Audiovisual Archives Y. Avrithis, G. Stamou, M. Wallace et al.
10.3217/jucs-009-06-0510
10/29/2010 3 Confidential Information Semantic store
The Multidimensional Learning Model: A Novel Cognitive Psychology-Based Model for Computer Assisted Instruction in Order to Improve Learning in Medical Students Tarek M. Abdelhamid, M.D., Formerly from the Medical Education Development Office, The University of Auckland School of Medicine and Health Science Figure 3 According to the level of processing approach the semantic (meaningful) coding of information provides the deepest form of processing.
10/29/2010 4 Confidential Information Major levels of linguistic structure
Source National Visualization and Analytics Center: Illuminating the Path: The Research and Development Agenda for Visual Analytics p.110., 2005
Author James J. Thomas and Kristin A. Cook (Ed.) 10/29/2010 5 Confidential Information Syntactic and semantic level description of the surface text
10/29/2010 6 Confidential Information Semantic Level
Announcing the Release of the August 2009 CTP for the Open XML SDK BrianJones 27 Aug 2009 9:21 AM
10/29/2010 7 Confidential Information Semantic Map
LearningTip #50: Strategies Help Reluctant Silent Readers Read to Learn By Joyce Melton Pagés, Ed.D. Educator, President of KidBibs
10/29/2010 8 Confidential Information Semantic Structure
Foundations of language: brain, meaning, grammar, evolution By Ray Jackendoff
http://www.cse.iitk.ac.in/users/amit/books/jackendoff-2002-foundations-of-language.html/
10/29/2010 9 Confidential Information Semantic Levels and Computer Evidence
http://www.anguas.com Mar 18 2010 10/29/2010 10 Confidential Information Semantic Gap
http://mklab.iti.gr/svmcf
10/29/2010 11 Confidential Information Semantic Filtering
M. Naphade, I. Kozintsev and T. Huang, “A Factor Graph Framework for Semantic Video Indexing”, accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology http://domino.research.ibm.com 10/29/2010 12 Confidential Information Semantic Vocabulary
Jingen Liu, Yang Yang and Mubarak Shah, , Learning Semantic Visual Vocabularies Using Diffusion Distance IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
10/29/2010 13 Confidential Information Semantic Needs
http://blog.webgenomeproject.org/internet- hierarchy-of-needs-elaborated-part-4/ 10/29/2010 14 Confidential Information Unified Semantic Ecosystem
http://dita.xml.org/wiki/level-six-universal- semantic-ecosystem 10/29/2010 15 Confidential Information Swoogle Semantic Web indexer
10/29/2010 16 Confidential Information Semantic Web
The element of the Semantic Web Can be encoded in XML Simplicity and mathematical consistency This is called Resource Description Framework (RDF)
http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium
10/29/2010 17 Confidential Information Semantic Web
RDF data...
Subject and object node using same URIs http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium
10/29/2010 18 Confidential Information Semantics Upper level classes Relations between concepts Predicates extracted from text Restrictions on content Labels on visual data « meaning » Shared master data space RDF triples
10/29/2010 19 Confidential Information MY EXPERIENCE WITH SEMANTICS
Confidential Information My thinking in 2007
Language identification tokenization
Word segmentation
Morphological analysis
Part of speech tagging
Lemmatisation Spelling correction, synonym expansion Phrase chunking
Dependency parsing Entity recognition Sentiment analysis Event analysis Document analysis Summarization clustering Topic identification Classification
10/29/2010 21 Confidential Information In 2008, I joined Exalead
16 milliards de pages web 2 milliards d’images 50 millions de videos
www.exalead.com/search
22 Confidential Information Searching for known items with facets ASIDE ABOUT ENTERPRISE SEARCH
Confidential Information 10/29/2010 24 Confidential Information Semantic Dimensions
Facets
Hunting Gathering Shopping
10/29/2010 25 Confidential Information Confidential Information END OF ASIDE BACK TO CLIENTS DESIRE FOR SEMANTICS
Confidential Information I learned that this is what Clients think
Anything added to but not in the original text is semantics
Anything fancier than indexing original strings
Anything that has "language"
10/29/2010 28 Confidential Information Semantics is three things to the ordinary person
Is this an example of that?
Are these two things the same?
What is the relation between these two things?
10/29/2010 29 Confidential Information Semantics is three things to the ordinary person
Is this an example of that? Language identification, entity extraction, word sense disambiguation, affect analysis
Are these two things the same? Tokenization, spelling correction, morphological analysis, summarization, synonym expansion, paraphrase, anaphor
What is the relation between these two things? Parsing, event extraction, relation extraction, classification, clustering
10/29/2010 30 Confidential Information Semantics in a Search Engine
Text fields « … the certification test is documented in the report … » Numerical fields 3.14159265 Date 16/11/1957 GPS coordinates, real world coordinates 48.451065619, 1.4392089 Categories Top/Animals/Pets/Dog Value separated fields Color: outside red ; interior: white ; trimming: silver Metadata (attribute:value) Source : dailymotion
10/29/2010 31 Confidential Information Exalead Fields capture most types, documents capture rows
Exalead Field types Text Dates Numbers Categories Exalead Documents Each doc an entity, database rows, or Each doc a text, with entities added in fields
32 10/29/2010 32 Confidential Information Semantics Processes in a Search Engine
autosuggestion language identification lemmatisation related terms related queries synonyms phonetic spell checker cross language faceted search Ontology Matcher local search phonetic lemmatisation synonyms stopwords sentiment analysis INDEX named entity
QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 33 10/29/2010 33 Confidential Information AutoSuggestion Example
Auto completion of user query as they type
10/29/2010 34 Confidential Information Auto Suggestion Resource
10/29/2010 35 Confidential Information Synonyms, Example
DB administrator
is defined as synonym of
Database Administrator
This synonymy can be in one direction or both ways
10/29/2010 36 Confidential Information Synonym Resource
Example 1
In this example the query "db administrator" generates query ("db administrator" or "database administrator" or "data base administrator") but "database administrator" and "data base administrator" queries do not perform expansion.
Example 2
In this example any of the queries "db administrator", "database administrator" or"data base administrator" will generate the combined query ("db administrator" or "database administrator" or "data base administrator")
10/29/2010 37 Confidential Information Phonetic Example
10/29/2010 38 Confidential Information Phonetic Resource contextualizable character sequence Phonetic form
comment for linguist developer
# names and old spelling "(oy)", "O.Y" # hoi (hi) For each language, "(oy)", "O.Y" # Plockhoy, Roy, Van Royen "(ooy)", "O.Y" # Verlooy phonetic "(oij)", "O.Y" # rules "(ooij)", "O.Y" # Van Ooijen are "(oeu)", "@" # oeuvre developed ## OU in house. # exceptions for foreign words "r(ou)tine", "OU" "g(ou)verneur", "OU" Currently "
# !! OU in names and old spelling CA CS DA DE "(ouw)", "A.OU" # bouw, sjouwer, mouw, stouwen EN ES ET FI
"(ou)", "A.OU" # zout (sel), vrouw (femme), koud (froid), nou (eh bien) FR IT NL NO PL PT RO RU SK SL SV
10/29/2010 39 Confidential Information Language Identification
54 languages are currently detected
10/29/2010 40 Confidential Information Lemmatisation
...
10/29/2010 41 Confidential Information Cross Language Information Retrieval
Afghanistan
10/29/2010 42 Confidential Information Spell Checker
10/29/2010 43 Confidential Information Samples from Google N-gram corpus
Sample of the 3-gram data in this Sample of the 4-gram data in corpus: this corpus:
ceramics collectables collectibles 55 serve as the incoming 92 ceramics collectables fine 130 serve as the incubator 99 ceramics collected by 52 serve as the independent 794 ceramics collectible pottery 50 serve as the index 223 ceramics collectibles cooking 45 ceramics collection , 144 serve as the indication 72 ceramics collection . 247 serve as the indicator 120 ceramics collection 120 serve as the indicators 45 ceramics collection and 43 serve as the indispensable 111 ceramics collection at 52 serve as the indispensible 40 ceramics collection is 68 serve as the individual 234 ceramics collection of 76 serve as the industrial 52 ceramics collection | 59 serve as the industry 607 ceramics collections , 66 serve as the info 42 ceramics collections . 60 serve as the informal 102 ceramics combined with 46 serve as the information 838 serve as the informational 41
Confidential Information FastRules
10/29/2010 45 Confidential Information RulesMatcher, Named Entities
image of use?
10/29/2010 47 Confidential Information SentimentAnalyzer
10/29/2010 48 Confidential Information Categorizer
Business
Consumer Shopping Services Documents D’apprentissage Customer Inqueries Pets Service
Business Signature de la classe
Consumer Shopping Services Nouveau Customer Inqueries Pets document Service
10/29/2010 49 Confidential Information Semantics Processes in a Search Engine
autosuggestion language identification lemmatisation related terms related queries synonyms phonetic spell checker cross language faceted search Ontology Matcher local search phonetic lemmatisation synonyms stopwords sentiment analysis INDEX named entity
QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 50 10/29/2010 50 Confidential Information With documents as containers With rich semantic markup With database offloading database queries generating documents Search engines can become part of a process
Search Based Applications
10/29/2010 51 Confidential Information Search-Based Applications
Search Engines can provide a powerful search-based infrastructure...
… for a new generation of applications
With Offloading Without Offloading Low complexity/costs, High performance/reusability High complexity/costs, Low performance/reusability
10/29/2010 52 Confidential Information Affordance of Search Engines
Web Search Technology has evolved over the past decade to perfect distribution of information
Handle 100s of Offer 99,999% millions of availability database records
Follow a 2-month Match low revenue innovation model roadmap ($0,0001/ session)
Be usable without Support impossible training to forecast traffic
Provide Present sub-second up-to-date response times information10/29/2010 53 Confidential Information 53 Search Search Databases Based engines Applications
10/29/2010 54 Confidential Information Transfer from research to industry SEMANTICS SUCCESS STORIES
Confidential Information Named Entities "Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities." (CoNLL 2002) MUC 1 (1987) thru MUC 6 (1995)
http://cs.nyu.edu/cs/faculty/grishman/
10/29/2010 56 Confidential Information Voxalead
10/29/2010 57 Confidential Information
10/29/2010 58 Confidential Information VoxaleadNews Methodology
RSS feed
Video file Meta data Title Date Channel Stream splitter …
Audio track Extract Video re .wav vignettes .jpg encoding .mp4
Speech-to-Text
Text with timetracking .xml
Metadata Push API Named Entities Index Cross Lingual
59
10/29/2010 59 Confidential Information
10/29/2010 60 Confidential Information
10/29/2010 61 Confidential Information
10/29/2010 62 Confidential Information
10/29/2010 63 Confidential Information
10/29/2010 64 Confidential Information Confidential Information Transfer from research to industry SEMANTICS SUCCESS STORY 2
Confidential Information Topic Segmentation
http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf
10/29/2010 68 Confidential Information
Professor Marti Hearst Topic Segmentation
http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf
Professor Marti Hearst
10/29/2010 69 Confidential Information Text segmentation in practice
10/29/2010 70 Confidential Information
10/29/2010 71 Confidential Information
10/29/2010 72 Confidential Information
10/29/2010 73 Confidential Information
10/29/2010 74 Confidential Information Transfer from research to industry SEMANTICS SUCCESS STORY 3
Confidential Information Brief History of Semantic Fields and Affect Analysis
active . spry young slow Deese (1965) semantic axes fast free association experiments old “big-small”, “hot-cold”, etc. passive
Berlin & Kay (1969) Lehrer (1974) semantic fields Levinson (1983) linguistic scale set of alternate or contrastive expressions arranged on an axis by degree of semantic strength red orange yellow green blue red amber tan yellow blue
10/29/2010 76 Confidential Information Affect Lexicons Lasswell Value Dictionary (1969) Eight dimensions: WEALTH, POWER, RECTITUDE, RESPECT, ENLIGHTENMENT, SKILL, AFFECTION, AND WELLBEING with positive or negative orientation e.g., admire: RESPECT (positive) General Inquirer dictionary (Stone, et al. 1965) 9051 headwords 1,915 positive and 2,291 negative words (Pos/Neg) also labels: Active, Passive, ... , Pleasure, Pain, …Human, Animate, …, Region, Route,…, Fetch, Stay, .. http://www.wjh.harvard.edu/~inquirer/inqdict.txt
10/29/2010 77 Confidential Information Clairvoyance Affect Lexicon
"arrogance" sn "superiority" 0.7 0.9 .. "gleeful" adj “happiness” 0.7 0.6 "gleeful" adj “excitement” 0.3 0.6 … 84 pair affect classes (positive/negative)
Confidential Information Finding Emotive Words using Patterns
appear almost feel extremely is so X look very seem too
“appearing so …” “feeling extremely…” “was too…” “looks almost….”
105 patterns with conjugated word forms
10/29/2010 79 Confidential Information Most Common Words from All Patterns
Using snippets from all 105 patterns: 8957 good 2906 important 2506 happy 2455 small 2024 bad 1976 easy 1951 far 1745 difficult 1697 hard 1563 pleased … in all 15111 different, inflected word
10/29/2010 80 Confidential Information SO-PMI of emotive pattern words 37.5 knowlegeable 33.6 tailormade 32.9 eyecatching 29.0 huggable 26.2 surefooted 24.6 timesaving 22.9 personable 21.9 welldone …. -17.1 underdressed -17.5 uncreative -18.1 disapproving -18.5 meanspirited -18.6 unwatchable -22.7 discombobulated
Confidential Information Mood of a city
URBANIZER.COM
Confidential Information http://labs.exalead.com/
10/29/2010 83 Confidential Information
10/29/2010 84 Confidential Information
10/29/2010 85 Confidential Information
10/29/2010 86 Confidential Information
10/29/2010 87 Confidential Information
10/29/2010 88 Confidential Information
10/29/2010 89 Confidential Information
10/29/2010 90 Confidential Information Extraction + Abstraction + Abstraction
"…decor is not particularly elegant…" ambiance ↓ 70 Business Lunch ≈ service ε [0,40] V cuisine ε [45,100] V ambiance ε [45,100]
10/29/2010 91 Confidential Information URBANIZER Domain modeling
Service Domain vocabulary
Cuisine Sentiment for domain
Syntactic patterns
Ambiance
Confidential Information Domain analysis, same as Affect Analysis
Deese (1965) semantic axes free association experiments quick “big-small”, “hot-cold”, etc. homey upscale
active
young slow casual
fast refined old passive leisurely
…spent {two, three} hours…
10/29/2010 93 Confidential Information
10/29/2010 94 Confidential Information
10/29/2010 95 Confidential Information
10/29/2010 96 Confidential Information Another example of domain modeling
10/29/2010 97 Confidential Information
10/29/2010 98 Confidential Information
10/29/2010 99 Confidential Information Foreseeable, near term FUTURE SUCCESSES
Confidential Information Dbpedia, RDF triples
10/29/2010 101 Confidential Information Wikipedia InfoBoxes
10/29/2010 102 Confidential Information Implicit tables
103 10/29/2010 103 Confidential Information Linked Open Data Cloud
2008 2008 2007 2008 2009 2008 2009
10/29/2010 104 Confidential Information LOD2 turns linked data into knowledge
interlink
SILK DXX Engine
create poolparty merge
OntoWiki SemMF
2007 2008 Sigma 20082008 WiQA2008 2009
ORE Virtouso verify classify DL-Learner MonetDB
Sindice
enrich
105 Confidential Information Challenge : searching ad hoc, idiosyncratic spaces
http://www.imrc.kist.re.kr/wiki/LifeLog
10/29/2010 106 Confidential Information Microsoft SenseCam
http://research.microsoft.com/en- us/um/cambridge/projects/sensecam/
http://justgetthere.us/blog/plugin/tag/cameras
10/29/2010 107 Confidential Information http://gadgets.infoniac.com/head-worn- camera-for-special-events.html
10/29/2010 108 Confidential Information Bike My Way
10/29/2010 109 Confidential Information Insta Mapper
10/29/2010 110 Confidential Information Challenge : searching ad hoc, idiosyncratic spaces
Hui Yang and Jamie Callan. A Metric-based http://www.imrc.kist.re.kr/wiki/LifeLog Framework for Automatic Taxonomy Induction. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL2009), Singapore. Aug 2-7, 2009
10/29/2010 111 Confidential Information Conclusions
Semantics What is this? Are these the same? Who does what to whom? Search Engines now handle rich semantics Search based applications, between DB and IR Semantic Research Successful transfer to industry Near Future Integration crowd source knowledge Wikipedia Linked open data
10/29/2010 112 Confidential Information Thank you
10/29/2010 113 Confidential Information