Text Mining: Creating Semantics in the Real World
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler
Text Mining: Creating Semantics in the Real World Fraunhofer IAIS: Intelligent Analysis and Information Systems 250 people: scientists, project engineers, technical and administrative staff, students Located on Fraunhofer Campus Schloss Birlinghoven/Bonn Joint research groups and cooperation with
Core research areas: Machine learning/data mining Multimedia pattern recognition Visual Analytics Process Intelligence Adaptive robotics Cooperating objects
Directors: T. Christaller, S. Wrobel (exec.) Prof. Dr. Stefan Wrobel 2 Text Mining: Creating Semantics in the Real World Brainyquote.com
Where is all the knowledge we lost with information?
T. S. Eliot
Thomas Stearns Eliot, OM (September 26, 1888 – January 4, 1965) US-born British poet, dramatist and literary critic
Prof. Dr. Stefan Wrobel 3
Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 4 Text Mining: Creating Semantics in the Real World Internet Trends
Ubiquitous intelligent systems
Convergence
Users as producers
Prof. Dr. Stefan Wrobel 5
Text Mining: Creating Semantics in the Real World Users as producers
Web 2.0, Social Web, Crowdsourcing Exploding growth of content Media providers transform from content to confidence providers, competing with social communities Users expect full interactivity and control
Quality control, confidence, choice and searching are becoming central
Prof. Dr. Stefan Wrobel 8 Text Mining: Creating Semantics in the Real World Drowning in Data …. MegabytesMegabytes GigabytesGigabytes TerabytesTerabytes
se: PetabytesPetabytes tal univivererse: ze of d digigiital un SSiize of byte : 161 Ex Exaabyte 20200707: 161 8 Exaabbyytete 201010: :99 998 Ex 20 ] [[IDIDCC] Prof. Dr. Stefan Wrobel ExabytesExabytes 9
Text Mining: Creating Semantics in the Real World The data iceberg
Database tables 20% Excel spreadsheets Other data with fixed structure
Email, Notes Word documents 80% PDF. Power Point Other text Images Video, audio
Prof. Dr. Stefan Wrobel 10 Text Mining: Creating Semantics in the Real World Drowning in Unstructured Data …. MegabytesMegabytes GigabytesGigabytes TerabytesTerabytes ingg!! ed meeaannin … aandnd neneed m … PetabytesPetabytes
Prof. Dr. Stefan Wrobel ExabytesExabytes 11
Text Mining: Creating Semantics in the Real World Semantics: The need for meaning
Knowledge will be the driving force of business excellence Quality of services increasingly distinguished by amount of knowledge they can use
Enormous savings if unstructured existing documents could be used Without needing to structure them first cf. failures of knowledge management!
Prof. Dr. Stefan Wrobel 12 Text Mining: Creating Semantics in the Real World The challenge of semantics
intelligent data and text mining technologies
Very large set Manual Structuring Intelligent Service of (electronic) documents
Prof. Dr. Stefan Wrobel 13
Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 14 Text Mining: Creating Semantics in the Real World Text Mining is cool, since … the entire world works for us!
215,675,903 websites (Netcraft, March 2009) 19 200 000 000 webpages (Yahoo, Aug 2005) 29 700 000 000 webpages (boutell.com, Jan 2007) Google index 26 Million (1998), 1 billion (2000), 1 Trillion 1,000,000,000,000) unique URLs (25. 7. 2008)
=> perhaps quadrillions of words (images, videos) …
And most of them put together meaningfully (somewhat)!
=> smart algorithms can build on that.
Prof. Dr. Stefan Wrobel 15
Text Mining: Creating Semantics in the Real World The basic idea
If two words occur frequently in the same context - page, paragraph, sentence, part-of-speech Then there must be some semantic relation between them
Add in a lot of statistics, algorithms, intelligence… rawraw material material (web,(web, documents,documents, …)…)
++ correlationscorrelationsAND YOU CAN and and DO statisticsstatistics A LOT! ++ intelligentintelligent datadata mining mining algorithms algorithms YouYou can can create create (a (a bitbit of) of) semantics!semantics!
Prof. Dr. Stefan Wrobel 16 Text Mining: Creating Semantics in the Real World
Automated
Text Mining: Creating Semantics in the Real World Unsupervised hierarchical term Clustering: dpa data
Team sport Spiel Bundesliga Team Trainer Sieg Mannschaft Niederlage Samstag Platz Saison Erfolg Punkte Pokal Nationalspieler …
Not football Football Finale Frankfurt deutsche FC Trainer Fußball München Meister Hamburg Zuschauer Spieler Bayern Mannschaft Zuschauern Männer Halle WM Saison Hertha Stürmer Stadion Titelverteidiger Final EM … Spiel SV Dortmund Coach …
Basketball Berlin Weltmeister Kiel Handball Magdeburg Minute Tor VfL Schiedsrichter Bayern League Champions Bonn Kampf K Hagen Trier Flensburg HSG VfL TV THW Zuschauer Minuten Führung Fußball UEFA United Hinspiel Würzburg LBA Playoff Runde Tore Bad Wuppertal Lemgo Tore Hansa Eintracht Schalke Cup Manager Vertrag Leeds Box Berliner Klitschko Titel… Bundesliga Handewitt … Bundesliga Karte Wolfsburg … Club Fans Real Hitzfeld …
Basketball + Boxing Handball German League European League
Prof. Dr. Stefan Wrobel 18 Text Mining: Creating Semantics in the Real World Text Mining Market Size
„The text mining market has roughly $50-100 million annual product revenue, and is growing at roughly 40-60% annually.“ [Monash 06, texttechnologies.com]
Sounds small … But then … • Several research sites devoted to the technology …
So the real market must be somewhere else …
Prof. Dr. Stefan Wrobel 19
Text Mining: Creating Semantics in the Real World The Text Mining Market … is called “Text Analytics”
Primary areas: Web search, site search knowledge management, enterprise portals Information collection, extraction, harvesting Email handling, security, spam and phishing filtering Market research Online advertising Specialized markets • litigation, juridical • Patent search
[cf. Monash/2008]
Prof. Dr. Stefan Wrobel 20 Text Mining: Creating Semantics in the Real World Application Field Market Research: Germany 1.6 billion, growing
Both ad-hoc studies and panels can benefit from text mining
http://www.adm-ev.de/zahlen.html
Prof. Dr. Stefan Wrobel 21
Text Mining: Creating Semantics in the Real World Enterprise Search as a text mining market
More than 1.2 billion $ in 2010
Year 2006 2007 2008 2009 2010 Software 717 860 989 1108 1219 revenue Million $
[Gartner 2008]
Prof. Dr. Stefan Wrobel 22 Text Mining: Creating Semantics in the Real World Companies
Prof. Dr. Stefan Wrobel 23
Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Wikinger, Fraunhofer Web • Structuring and Monitoring: Semantic Map, EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 24 Text Mining: Creating Semantics in the Real World Text Mining Tasks
Document classification, scoring and/or ranking, isolated retrieval • Assign a class, score or rank to an entire document In-collection, linked retrieval and organization • Find documents in a collection • Link results to other results Information and relation extraction • Extract pieces of information, fill particular relations Overview and monitoring of collections • Give summary impression of information in a collection or source
Prof. Dr. Stefan Wrobel 25
Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 26 Text Mining: Creating MotivationSemantics in the Real World Spotting Faked Offers at Internet Auctions
Techniques to sell fakes Put faked products on an internet auction platform, e.g. ebay Describe product as forged, falsified, e.g. “very similar to XXX”
Aspects Infringement of registered trade marks Violation of patents Enormous sales volume
Prof. Dr. Stefan Wrobel 27
Text Mining: Creating Semantics in the Real World
Counter Measures Motivation
Use trainable classifiers Compile training set of genuine and Fakes faked internet auction offers x2 Train classifiers to detect these classes use text, format information, etc. as features H Use different classifier for different y pe brands / products rp la Apply to new internet auction ne offers Originals Ban faked offers from auction x1 Update classifiers to new techniques
Prof. Dr. Stefan Wrobel 28 Text Mining: Creating Semantics in the Real World Results
A classifier was developed and The Germal Federal Court of Justice: tested Internet Auction providers have to filter the auctions using approriate Similar techniqes as for spam methods to detect faked offers. detection Good results: F-value >> 90%
Prof. Dr. Stefan Wrobel 29
Text Mining: Creating MotivationSemantics in the Real World Phishing
E-mail fraud Send officially looking email Include web link or form Ask for confidential information e.g. password, account details Attacker uses information to withdraw money, enter computer system, etc.
Prof. Dr. Stefan Wrobel 30 Text Mining: Creating MotivationSemantics in the Real World AntiPhish
Project AntiPhish Consortium Develop content-based phishing filters Fraunhofer IAIS (DE) Include other clues like whitelists Symantec (GB, IRL) Trainable and adaptive filters Tiscali (IT) Î adapt to new phishing attacks Nortel (FR) Î anticipate attacks K.U. Leuven (BE)
Prof. Dr. Stefan Wrobel 32
Text Mining: Creating MotivationSemantics in the Real World Phishing: Defense Techniques
Workflow Obtain training data from email Integrate new filters into email stream filtering framework Extract features Deploy at internet service provider Estimate and update classifiers and Deploy at central wireless packet filters switch Prof. Dr. Stefan Wrobel 33 Text Mining: Creating Semantics in the Real World Approach: Multiple feature sets
Prof. Dr. Stefan Wrobel 37
Text Mining: Creating Semantics in the Real World Basic Features
Prof. Dr. Stefan Wrobel 38 Text Mining: Creating Semantics in the Real World Dynamic Markov Chains
Prof. Dr. Stefan Wrobel 39
Text Mining: Creating Semantics in the Real World DMC Details
Prof. Dr. Stefan Wrobel 40 Text Mining: Creating Semantics in the Real World Latent Topic Models
Prof. Dr. Stefan Wrobel 41
Text Mining: Creating Semantics in the Real World Class-Specific Topic Models
Prof. Dr. Stefan Wrobel 42 Text Mining: Creating Semantics in the Real World Feature Processing and Selection
Prof. Dr. Stefan Wrobel 43
Text Mining: Creating Semantics in the Real World Test Corpora
Prof. Dr. Stefan Wrobel 46 Text Mining: Creating Semantics in the Real World Overall Result
Prof. Dr. Stefan Wrobel 47
Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 52 Text Mining: Creating Semantics in the Real World
The THESEUS research program in Germany [theseus-program.com] Deutsche Nationalbibliot hek Deutsche Thomson OHG (DTO) Deutsches Forschungszentr um für Künstliche Intelligenz (DFKI GmbH) empolis GmbH Festo AG Fraunhofer- Gesellschaft (7 Institutes) Friedrich-Alexander- Universität
semantic Erlangen FZI Forschungszentrum Informatik Institut für Rundfunktechni k GmbH (IRT) intelligent views gmbh Ludwig-Maximilians- Universität (LMU) moresophy GmbH LYCOS Europe mufin GmbH ontoprise GmbH SAP AG Siemens AG Technische Universität Wess/07 Darmstadt Technische Universität syntactic Dresden Technische Universität München Universität Karlsruhe (TH) Single author Multiple authors Verband Deutscher Maschinen- und Anlagebau e.V. (VDMA)
Prof. Dr. Stefan Wrobel 53
Text Mining: Creating Semantics in the Real World The THESEUS use cases
CONTENTUS MEDICO ALEXANDRIAALEXANDRIA CONTENTUS MEDICO Next Generation Digital Libraries Semantic image Search The Internet Knowledge Platform Next Generation Digital Libraries Semantic image Search The Internet Knowledge Platform forfor saving saving our our cultural cultural heritage heritage inin Medicine Medicine
ORDO TEXO ORDO PROCESSUSPROCESSUS TEXO PersonalPersonal Ordered Ordered Knowledge Knowledge BusinessBusiness Webs Webs in in the the Internet Internet Management Semantic Business Processes Of Things Management Semantic Business Processes Of Things
Prof. Dr. Stefan Wrobel 54 Text Mining: Creating Semantics in the Real World CONTENTUS - Next Generation Digital Libraries for saving our cultural heritage
Publishers, Libraries, broadcasters, etc. are interested in using, distributing and saling their archive content In analog form archives are threatened by deterioration, are not linked, difficult to use, and huge.
Goals: Digitalization, optimization of quality, availability Indexing, semantic and social linking and intelligent search, communities Laufzeit bis Rescue of cultural heritage, preventing losses from deterioration 2012
Prof. Dr. Stefan Wrobel 55
Text Mining: Creating Semantics in the Real World Showcases Semantic Digital Libraries
225 years Neue Zürcher Zeitung NZZ
GDR music archive German National Library
Prof. Dr. Stefan Wrobel 56 CONTENTUS Workflow Text Mining: Creating Semantics in the Real World Workflow
Data generation: registered users / communities, Controlled quality Data generation: registered algorithms with acceptable Data generation: automatically users / communities, quality generatedControlled through quality high-quality algorithms with acceptable Quality control: self control algorithmsData generation: automatically quality (see Wikipedia) Qualitygenerated control: throughtraining high-quality and Quality control: self control improvementalgorithms of algorithms (see Wikipedia) Quality control: training and improvement of algorithms
Guaranteed quality Shell Data generation and Mantle correctionGuaranteed: Libraries, quality museums, Data generation and Core universities, experts, etc. Qualitycorrection control:: Libraries,Schooling, museums, rules,universities, advisory boards experts, etc. HighestQuality stability, control: highestSchooling, persistencerules, advisory boards Highest stability, highest 1 2 3 4 5 persistence 6 Digitization Automated Automated Semantic Open Semantic optimization of generation of linking of knowledge access to quality metadata metadata networks – user knowledge augmentation and content
Prof. Dr. Stefan Wrobel 57
Digitalisierung Text Mining: Creating Semantics in the Real World
Open Semantic 1 2 Automatic 3 Automated 4 Semantic 5 knowledge 6 access to Digitization Optimization Linking of networks – Generation of knowledge of quality metadata user metadata and content augmentation
High-Throughput Methods Modern book scanners: Thousands of pages per day Almost fully automatic
Data volumes: 70TB (NZZ), Peta-Exabytes (DNB)
Prof. Dr. Stefan Wrobel 58 Digitalisierung Text Mining: Creating Semantics in the Real World
1 DigitalisierungDigitization
High-Throughput Methods Modern book scanners: Thousands of pages per day Almost fully automatic
Data volumes: 70TB (NZZ), Peta-Exabytes (DNB)
Prof. Dr. Stefan Wrobel 59
Qualitätsoptimierung Text Mining: Creating Semantics in the Real World
1 2 Automated DigitalisierungDigitalization Optimization of quality Margin removal
Development of intelligent Sharpening, algorithms for Straightening optimizing print, images, sound & movies Automated generation of Denoising, declicking presentation formats Scratch removal
Prof. Dr. Stefan Wrobel 60 Metadatengenerierung Text Mining: Creating Semantics in the Real World
1 2 Automated 3 Automated DigitalisierungDigitalization Optimization Generation of of quality metadata
Structural and contentual metadata OCR, speech, music, video recognition Structure analysis and type recognition Linking with current norms & standards
Prof. Dr. Stefan Wrobel 61
Semantische Vernetzung Text Mining: Creating Semantics in the Real World
1 2 Automated 3 Automated 4 Semantic DigitalisierungDigitalization Optimization Generation of linking of of quality metadata contents
Link-up with related media Incorporation of external knowledge sources (metadata systems, Wikipedia, …) Disambiguation, classification, relation extraction
Prof. Dr. Stefan Wrobel 62 Text Mining: Creating Semantics in the Real World Determining meaning
The words of natural language are often Über Kohl höhnte Strauß: „Er wird nie Kanzler ambiguous werden. Die Zeit, 18.7.08
» For each word / term, find a meaning » Subproblem: » Part of speech recognition: Nouns, Verb, Adjective, … » Named entity recognition: People, Places, Organizations, … » Assignment of concepts: Plant, Bird, Politician, …
Prof. Dr. Stefan Wrobel 63
Text Mining: Creating Semantics in the Real World Named entity recognition
» Analyze Über Kohl höhnte Strauß: „Er wird nie Kanzler Surroundings of werden. Words » “Kohl” in a sentence with “Kanzler” Î probably “person” » “Kohl” in a sentence with “kochen” Î probably “vegetable”
» Statistical model for person names » Word + Surroundings -> word is a person » Training using annotated sentences.
» Automatic Recognition of words / phrases that represent people
Prof. Dr. Stefan Wrobel 64 Text Mining: Creating Semantics in the Real World Conditional Random Field Model
» Observed words X1,…,Xn
» Category of words Y1,…,Yn 1 ⎛ N ⎡ N2 ⎤⎞ p(Y , ,Y | X ) = exp⎜ λ f (Y ,Y ,X) ⎟ 1 K n ⎜ ∑∑⎢ k k,C t−1 t ⎥⎟ Z(X,λ,μ) ⎝t==11⎣k ⎦⎠ » Properties f may depend on two subsequent states and on all observed words
Example
» Property f10293 has value 1, -if Yt-1=“PER" and Yt=“PER” and - Xt has value “Müller”. Otherwise its value is 0. [Lafferty, McCallum, Pereira 01]
Prof. Dr. Stefan Wrobel 66
Text Mining: Creating Semantics in the Real World Modeling of names: features for a CRF model
» Title FirstName Connective LastName
» Properties. Recorded for the words xt-2,xt-1,xt,xt+1,xt+2 » Words, stem, part of speech » Prefix, Suffix (3 letters) » Shape properties Capital characters at the beginning, only numbers, contains numbers, mix capital /no capital, contains hyphens » LDA topic model class » Contained in list of first names, contained in list of last names eit Arb In
Prof. Dr. Stefan Wrobel 67 Text Mining: Creating Semantics in the Real World Identity of names
» There are several people named “Helmut Kohl” » Helmut Kohl, born 1930, Chancellor » Helmut Kohl, born 1943, Referee » Helmut Kohl, textile merchand » … 99 further hits in the telephone book
» Identification in Wikipedia » Compare words of Wikipedia-article with the text in which “Helmut Kohl” was found » Similar words -> similar person
» Automated assignment: Person name -> Wikipedia article
Prof. Dr. Stefan Wrobel 68
Text Mining: Creating Semantics in the Real World Assignment of similarity from the environment
Simple algorithm for assigning people to Wikipedia article » Occurence in text: Helmut Kohl » Description using characteristic terms -> x » Wikipedia article on Kohl » Description by characteristic terms -> w » Comparison using a distance metric: for example Cosine distance d(w,u)
» Implemented in a prototype
» Further approach: Assignment as a classification task f(w,u) = 0 or 1 Master Thesis
Prof. Dr. Stefan Wrobel 70 Semantische Interpretation Text Mining: Creating Semantics in the Real World Semantic Interpretation
Currently assign semantic categories in the Contentus Prototype
» Names: People, Organizations, points in time, places, …
» Assignment to Wikipedia articles
Under development:
» Hypernyms in ontology (GermaNet): Nouns, Verbs Î Supersenses
» Cluster of words with similar meaning: Topics
» Relations between names / concepts “Berthold Brecht” studied in “München”
» Classes of documents: Politics, Economy, …
Prof. Dr. Stefan Wrobel 72
Text Mining: Creating Semantics in the Real World
Knowledge store Helmut Kohl Geburtsdatum 30.4.1930
» Further information for entities that Geburtsort Ludwigshafen were found in the text Ehegatte Hannelore K. Ausbildung Historiker » Dates, publications Religion katholisch » Number of inhabitants, topological Partei CDU relationships
Berlin
Fläche 891 km2
»Socialnetworks Einwohner 3.420.786 » Who knows whom? BIP 83,6 Mrd. €
»Whowas at thesameplaceat thesametime? Höhe 34–115 m
» Who influenced whom? Geo. Breite 52° 31′ N
Geo. Länge 13° 25′ O
Prof. Dr. Stefan Wrobel 87 Text Mining: Creating Semantics in the Real World Knowledge store: Format
» Factual knowledge as logical expressions: »
Prof. Dr. Stefan Wrobel 88
Buchmesse Frankfurt 10.Oktober 2008 | 88
Wissensvernetzung Text Mining: Creating Semantics in the Real World Linking of knowledge
» Semantic Integration of data and information from different sources » DBPedia: an interpreted form of Wikipedia » Geonames Ontology: all the places in the world » Catalogue of the German national library: Books and publications
Î Triplestore
» Based on open standards » W3C Semantic Web Stack » RDF, RDFS, OWL, SPARQL
Prof. Dr. Stefan Wrobel 89 Text Mining: Creating Semantics in the Real World Knowledge sources
» DBPedia (www.dbpedia.org) » GeoNames Ontology » Already in RDF/OWL-Format » Person reference database PND » Topic reference database SWD » Online catalogue OPAC » Partial export to RDF » Found entities in the text » Identification using Wikipedia » Linking with DBPedia-Daten per Link
Prof. Dr. Stefan Wrobel 90
Buchmesse Frankfurt 10.Oktober 2008 | 90
Offene Wissensnetzwerke Text Mining: Creating Semantics in the Real World
Open 1 2 Automated 3 Automated 4 Semantic 5 knowledge DigitalisierungDigitalization Optimization Generation of linking of networks – of quality metadata metadata user augmentation
Further annotations from experts and users • Completions, corrections • Cooperation with the ALEXANDRIA project in Theseus • Suitable measures to assure high quality of data
Prof. Dr. Stefan Wrobel 91 Offene Wissensnetzwerke Text Mining: Creating Semantics in the Real World The Multiple Shell Model Cf. Wikinger [Bröcker et.al. 08]!
OpenOpen knowledge knowledge network network Outer DataData generation generation: :Registered Registered users users / /Communities, Communities, AlgorithmsAlgorithms Quality control: Self control (cf. Wikipedia) Mantel Quality control: Self control (cf. Wikipedia) ControlledControlled Quality Quality DataData generation generation: :Algorithms Algorithms of of high high quality quality Core QualityQuality control: control:TrainingTraining and and improvement improvement of of algorithmsalgorithms
AssuredAssured quality quality DataData generation generation and and correction correction: :Libraries, Libraries, Universities,Universities, Museums, Museums, groups groups of of experts, experts, etc. etc. Quality control: Fixed rules, committes, Prof. Dr. Stefan Wrobel Quality control: Fixed rules, committes, 92 maximalmaximal Stability Stability and and Persistence Persistence
Semantische Suche Text Mining: Creating Semantics in the Real World
Open Semantic 1 2 Automated 3 Automated 4 Semantic 5 knowledge 6 access to DigitalisierungDigitalization Optimization linking of networks – Generation of knowledge of quality metadata user metadata and content augmentation
The knowledge network • Digital, multimedia data • Content is semantically linked • Is enriched from external sources and user groups Access • Structure by Ontology • Content relationships become clear • “Knowledge exploration” is possible
Prof. Dr. Stefan Wrobel 93 Text Mining: Creating Semantics in the Real World The Contentus Demonstrator
Prof. Dr. Stefan Wrobel 95
Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 102 Text Mining: Creating Semantics in the Real World Example of a classified webpage
Übereinstimmung zu dem Dokumentmodell = 80% Klassifikation als = Projekte
Prof. Dr. Stefan Wrobel 107
Text Mining: Creating Semantics in the Real World Workflow for semantic processing of documents
9!! 200009 Pre- Categoriza Entity err 2 processing tion Recognitio Extractedm e mMetadatam Search index n su m iinn su ncchh Crawl lauun Documents eela hherer fortrt t fo Using the document model oouut tcchh Waat W Extracted Knowledge Search Store regions Using the structure model
Prof. Dr. Stefan Wrobel 111 Text Mining: Creating Semantics in the Real World Outline
Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Conclusion
Prof. Dr. Stefan Wrobel 112
Text Mining: Creating Semantics in the Real World Emotion Radar
Which issues are important to people, where are the emotional discussions in blogs and discussion forums?
Goal: market research, …
Prof. Dr. Stefan Wrobel 113 Text Mining: Creating Semantics in the Real World Emotion Radar example: Looking at two large automobil companies A and B*
Selection and Crawling of discussion forums
• Criteria: Search engine ranking, size, activity
• Period used: January 2008 to January 2009
• Storage: 2 GB of data crawl during a period of 7 days
Structure analysis of the discussion forums:
• Manufacturer A:
– Number of postings: 188.487 – Monthly number of new postings: ca. 1.500 – Number of threads: 21.613 – Number of authors: 15.445 • Manufacturer B:
– Number of postings: 406.814 – Monthly number of new postings: ca. 2.700 – Number of threads: 38.758 – Number of authors: 21.919
* anonymisiert
Prof. Dr. Stefan Wrobel 114
Text Mining: Creating Semantics in the Real World Case study: Internet postings related to the introduction of a new car model in Germany 2008
Cars are delivered
Manufacturer publishes further product features/start of sales
Manufacturer publishes first pictures
*Prof. anonymisiert Dr. Stefan Wrobel 115 Text Mining: Creating Semantics in the Real World Partially automated emotion analysis shows a mood swing from positive to negative („in love-> angry“)
Cars are delivered
− „angry“− „angry“
− „proud“− „proud“
− „angry“− „angry“ Manufacturer publishes further product features/start of sales
Manufacturer − „surprised“− „surprised“ publishes first pictures
− „turned− „turned off“ off“
„in„in love“ love“ − „hoping“− „hoping“ − „interested“− „interested“
*Prof. anonymisiert Dr. Stefan Wrobel 116
Text Mining: Creating Semantics in the Real World
Topic recognition shows a change of product features that are discussed from design to gasoline consumption
Auslieferungen
− Probefahrten− Probefahrten
− Verbrauch− Verbrauch − Verbrauch,− Verbrauch, − „verärgert“− „verärgert“ Klappschlüssel,Klappschlüssel, AudioAudio
− „verärgert“− „verärgert“ − Preise− Preise Car- Car- KonfiguratorKonfigurator − Nachbarn− Nachbarn − Chromleisten,− Chromleisten, undund Liste Liste − „stolz“ WertanmutungWertanmutung − „stolz“ Erste Fotos − „überrascht“− „überrascht“ − „Riesen-− „Riesen- „...kaum zu glauben fischmaul“fischmaul“ „...kaum zu glauben was dieses kleine − „abgestoßen“− „abgestoßen“ was dieses kleine − Schiebedach,− Schiebedach, Auto an Benzin − Schaltung,− Schaltung, Lenkrad,Lenkrad, Auto an Benzin Effizienz,Effizienz, verbraucht!“ Bordcomputer,Bordcomputer, TechnologieTechnologie verbraucht!“ AudiosystemAudiosystem (Kalle83) − Fahrzeug-− Fahrzeug- (Kalle83) länge,länge, Design Design − Verbrauch− Verbrauch − „zugeneigt“− „zugeneigt“
− „verliebt“− „verliebt“ − „hoffend“− „hoffend“
*Prof. anonymisiert Dr. Stefan Wrobel 117 Text Mining: Creating Semantics in the Real World How can manufacturers use these text mining results?
Auslieferungen e.g.Z.B. short durch term einerecognition frühzeitige of relevant
− Probefahrten− Probefahrten topicsErkennung(consumption) relevanter and Themen preparation − Verbrauch− Verbrauch of(Verbrauch) appropriate und resonse Ableiten(gas von saver − Verbrauch,− Verbrauch, − „verärgert“− „verärgert“ Klappschlüssel,Klappschlüssel, trainings,Maßnahmen fuel efficient(Spritspartrainings, tires, proactive AudioAudio communikation)Leichtlaufreifen, Kommunikation) − „verärgert“− „verärgert“ − Preise− Preise Car- Car- KonfiguratorKonfigurator − Nachbarn− Nachbarn − Chromleisten,− Chromleisten, undund Liste Liste − „stolz“ WertanmutungWertanmutung − „stolz“ − „überrascht“− „überrascht“ − „Riesen-− „Riesen- fischmaul“fischmaul“
− „abgestoßen“− „abgestoßen“ − Schiebedach,− Schiebedach, − Schaltung,− Schaltung, Lenkrad,Lenkrad, Effizienz,Effizienz, Bordcomputer,Bordcomputer, TechnologieTechnologie AudiosystemAudiosystem − Fahrzeug-− Fahrzeug- ... Long term continuous länge,länge, Design Design − Verbrauch− Verbrauch − „zugeneigt“− „zugeneigt“
− „verliebt“− „verliebt“ − „hoffend“− „hoffend“ monitoring of emotional topics
*Prof. anonymisiert Dr. Stefan Wrobel 118
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel 119 Text Mining: Creating Semantics in the Real World Summary
Text Mining is cool! • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets
We can do a lot with Text Mining in the Real World! • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar
Prof. Dr. Stefan Wrobel 120
Text Mining: Creating Semantics in the Real World The fine print: Papers and further reading
Paaß, Gerhard; Reinhardt, Wolf; Rüping, Stefan; Wrobel, Stefan: Data Horváth, Tamás; Ramon, Jan:Efficient frequent connected subgraph mining in mining for security and crime detection In: Gal, Cecilia S. (Ed.) et al.: graphs of bounded treewidth: Machine learning and knowledge discovery in Security informatics and terrorism: social and technical problems of database: ECML PKDD 2008. Berlin [u.a.]: Springer, 2008. (Machine learning and detecting and controlling terrorists' use of the World Wide Web ; knowledge discovery in databases 1), S. 520-535 proceedings of the NATO Advanced Research Workshop on Security Kolb, Inke; Deutschland / Bundesbeauftrager für Kultur und Medien; Fraunhofer- Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 Institut IAIS: Auf dem Weg zur Deutschen Digitalen Bibliothek (DDB): erstellt im June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series Auftrag des Beauftragten der Bundesregierung für Kultur und Medien. 2008 D, Information and Communication Security 15), S. 56-70 Köhler, Joachim; Larson, Martha; Jong, Franciska de Jong; Kraaij, Wessel; Paaß, Gerhard; Kindermann, Jörg: Entity and relation extraction in texts Ordelman, Roeland; Association for Computing Machinery / Special Interest Group with semi-supervised extensions In: Gal, Cecilia S. (Ed.) et al.: Security on Information Retrieval: Proceedings of the ACM SIGIR Workshop "Searching informatics and terrorism: social and technical problems of detecting Spontaneous Conversational Speech": held in conjunction with the 31th Annual and controlling terrorists' use of the World Wide Web ; proceedings of International ACM SIGIR Conference 24 July 2008, Singapore, 2008 the NATO Advanced Research Workshop on Security Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007. Krausz, Barbara; Herpers, Rainer: Event detection for video surveillance using an Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D, expert system In: Association for Computing Machinery / Special Interest Group on Information and Communication Security 15), S. 132-141 Multimedia: 1st ACM International Workshop in Analysis and Retrieval of Events/Actions and Workflows in Video Streams (AREA 2008): October 31, 2008, Frank Reichartz and Gerhard Paaß. Estimating Supersenses with Vancouver, Canada ; in conjunction with ACM Multimedia 2008. New York, NY: Conditional Random Fields. Workshop on High-Level Information ACM, 2008, S. 49-55 Extraction, ECML/PKDD 2008. Lioma, Christina; Moens, Marie-Francine; Gomez, Juna-Carlos; De Beer, Jan; Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel, Bergholz, Andre; Paass, Gerhard; Horkan, Patrick: Anticipating Hidden Text Salting Marie-Francine Moens and Brian Witten: Detecting Known and New in Emails: extended abstract. In: Lippmann, Richard (Ed.) et al.: Recent advances in Salting Tricks in Unwanted Emails Fifth Conference on Email and Anti- intrusion detection: 11th international symposium, RAID 2008, Cambridge, MA, Spam, CEAS 2008, Aug 21-22, 2008 USA, September 15-17, 2008 ; proceedings. Berlin [u.a.]: Springer, 2008. Andre Bergholz,Jeong-Ho Chang, Gerhard Paass, Frank Reichartz and Anja Pilz, Lukas Molzberger, and Gerhard Paa. Entity resolution by kernel methods. Siehyun Strobel. Improved Phishing Detection using Model-Based In Proc. Sabre TMS, 2009. Features. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 21- 22, 2008, Mountain View, Ca. Andre Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard Paass, Siehyun Strobel. New Filtering Approaches for Phishing Email. Accepted for Stefan Eickeler, Lars Br¨ocker, and Ruth Haener. NZZ: 225 Jahre Old publication for Journal of Computer Security (JCS) economy vernetzt - Realisierung des digitalen Archivs der Neuen Zürcher Zeitung. In GI Jahrestagung, pages 73–77, 2005. Gerhard Paass and Frank Reichartz (2009): Exploiting Semantic Constraints for Estimating Supersenses with CRFs. Proc. SDM 2009 (accepted for publication)
Prof. Dr. Stefan Wrobel 121