Quick viewing(Text Mode)

Text Mining: Creating Semantics in the Real World

Text Mining: Creating Semantics in the Real World

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler

Text Mining: Creating Semantics in the Real World Fraunhofer IAIS: Intelligent Analysis and Systems ƒ 250 people: scientists, project engineers, technical and administrative staff, students ƒ Located on Fraunhofer Campus Schloss Birlinghoven/Bonn ƒ Joint groups and cooperation with

Core research areas: ƒ / ƒ Multimedia ƒ Visual ƒ Process Intelligence ƒ Adaptive robotics ƒ Cooperating objects

Directors: T. Christaller, S. Wrobel (exec.) Prof. Dr. Stefan Wrobel 2 Text Mining: Creating Semantics in the Real World Brainyquote.com

Where is all the knowledge we lost with information?

T. S. Eliot

Thomas Stearns Eliot, OM (September 26, 1888 – January 4, 1965) US-born British poet, dramatist and literary critic

Prof. Dr. Stefan Wrobel 3

Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • : eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 4 Text Mining: Creating Semantics in the Real World Internet Trends

Ubiquitous intelligent systems

Convergence

Users as producers

Prof. Dr. Stefan Wrobel 5

Text Mining: Creating Semantics in the Real World Users as producers

ƒ Web 2.0, Social Web, Crowdsourcing ƒ Exploding growth of content ƒ Media providers transform from content to confidence providers, competing with social communities ƒ Users expect full interactivity and control

ƒ Quality control, confidence, choice and searching are becoming central

Prof. Dr. Stefan Wrobel 8 Text Mining: Creating Semantics in the Real World Drowning in Data …. MegabytesMegabytes GigabytesGigabytes TerabytesTerabytes

se: PetabytesPetabytes tal univivererse: ze of d digigiital un SSiize of byte : 161 Ex Exaabyte 20200707: 161 8 Exaabbyytete 201010: :99 998 Ex 20 ] [[IDIDCC] Prof. Dr. Stefan Wrobel ExabytesExabytes 9

Text Mining: Creating Semantics in the Real World The data iceberg

ƒ tables 20% ƒ Excel spreadsheets ƒ Other data with fixed structure

ƒ , Notes ƒ documents 80% ƒ PDF. Power Point ƒ Other text ƒ Images ƒ Video, audio

Prof. Dr. Stefan Wrobel 10 Text Mining: Creating Semantics in the Real World Drowning in …. MegabytesMegabytes GigabytesGigabytes TerabytesTerabytes ingg!! ed meeaannin … aandnd neneed m … PetabytesPetabytes

Prof. Dr. Stefan Wrobel ExabytesExabytes 11

Text Mining: Creating Semantics in the Real World Semantics: The need for meaning

ƒ Knowledge will be the driving force of business excellence ƒ Quality of services increasingly distinguished by amount of knowledge they can use

ƒ Enormous savings if unstructured existing documents could be used ƒ Without needing to structure them first cf. failures of knowledge management!

Prof. Dr. Stefan Wrobel 12 Text Mining: Creating Semantics in the Real World The challenge of semantics

intelligent data and text mining technologies

Very large set Manual Structuring Intelligent Service of (electronic) documents

Prof. Dr. Stefan Wrobel 13

Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 14 Text Mining: Creating Semantics in the Real World Text Mining is cool, since … the entire world works for us!

ƒ 215,675,903 websites (Netcraft, March 2009) ƒ 19 200 000 000 webpages (Yahoo, Aug 2005) ƒ 29 700 000 000 webpages (boutell.com, Jan 2007) ƒ Google index 26 Million (1998), 1 billion (2000), 1 Trillion 1,000,000,000,000) unique URLs (25. 7. 2008)

ƒ => perhaps quadrillions of (images, videos) …

ƒ And most of them put together meaningfully (somewhat)!

ƒ => smart can build on that.

Prof. Dr. Stefan Wrobel 15

Text Mining: Creating Semantics in the Real World The basic idea

ƒ If two words occur frequently in the same context - page, paragraph, sentence, part-of-speech ƒ Then there must be some semantic relation between them

Add in a lot of , algorithms, intelligence… rawraw material material (web,(web, documents,documents, …)…)

++ correlationscorrelationsAND YOU CAN and and DO statisticsstatistics A LOT! ++ intelligentintelligent datadata mining mining algorithms algorithms YouYou can can create create (a (a bitbit of) of) semantics!semantics!

Prof. Dr. Stefan Wrobel 16 Text Mining: Creating Semantics in the Real World

Automated Clustering [Paass 07] …… 100 000 documents Bayern<title>Bayern München München verlor verlor Tabellenführung Tabellenführung und und Elber Elber beim beim 1 1 : :1 1 in in Wolfsburg Wolfsburg AusgerechnetAusgerechnet der der VfL VfL Wolfsburg Wolfsburg hat hat den den FC FC Bayern Bayern München München vom vom Th Thronron der der Fußball Fußball - - Bundesliga Bundesliga gestoßen gestoßen . . MitMit dem dem 1 1 : :1 1 ( (0 0 : :1 1 ) )gelang gelang den den Wo Wolfsburgernlfsburgern am am Samstag Samstag der der erste erste P Punktunkt im im sechsten sechsten Spiel Spiel gegen gegen den den DeutschenDeutschen Rekordmeister Rekordmeister . .Durch Durch das das Remis Remis und und den den glei gleichzeitigenchzeitigen Sieg Sieg von von Konkurrent Konkurrent Bayer Bayer Leverkusen Leverkusen beimbeim TSV TSV 1860 1860 München München verlor verlor der der FC FC Bayern Bayern die die Tabellenführung Tabellenführung . .Carsten Carsten Jancker Jancker ( (29 29 . .) )hatte hatte die die Gäste Gäste in in FührungFührung gebracht gebracht . .Doch Doch vor vor 20 20 400 400 Zu Zuschauernschauern im im ausverkauften ausverkauften VfL VfL - - Stadion Stadion wurden wurden die die Bayern Bayern für für ihre ihre pomadigepomadige Spielweise Spielweise durch durch den den Wolfs Wolfsburgerburger Ausgleichstreffer Ausgleichstreffer von von Andrzej Andrzej Juskowiak Juskowiak ( (60 60 . .) )bestraft bestraft . .Zudem Zudem verlorverlor das das Team Team von von Trainer Trainer Ottmar Ottmar Hitzfeld Hitzfeld auch auch noc nochh Stürmer Stürmer Giovane Giovane Elber Elber ( (80 80 . .) ). .Er Er sah sah wegen wegen einer einer TätlichkeitTätlichkeit gegen gegen VfL VfL - - Abwehrspieler Abwehrspieler Holger Holger Ballwanz Ballwanz die die Rote Rote Karte Karte . .Die Die Bayern Bayern gingen gingen ersatzgeschwächt ersatzgeschwächt in in diedie Partie Partie . .Vor Vor allem allem das das Fehlen Fehlen des des verletzten verletzten Regisse Regisseursurs Stefan Stefan Effenberg Effenberg und und de des sebenfalls ebenfalls angeschlagenen angeschlagenen MehmetMehmet Scholl Scholl machte machte sich sich bemerkbar bemerkbar . .Die Die Wolfsburge Wolfsburger rmussten mussten weiter weiter auf auf die die Ab Abwehrspielerwehrspieler Claus Claus Thomsen Thomsen undund Thomas Thomas Hengen Hengen sowie sowie den den gesperrten gesperrten Waldemar Waldemar Kryger Krygerverzichtenverzichten . .Die Die Münchener Münchener konnten konnten ihre ihre Ausfälle Ausfälle anfangsanfangs besser besser kompensieren kompensieren . .Aus Aus einer einer gestärkt gestärktenen Deckung Deckung , ,die die vor vor der der Pau Pausese nur nur selten selten von von den den WolfsburgerWolfsburger Stürmern Stürmern Juskowiak Juskowiak und und Jonathan Jonathan[Paass07] Akpobor Akpoborieie gefordert gefordert wurde wurde , ,kontro kontrolliertenllierten die die Bayern Bayern die die PartiePartie . .Mit Mit ihrer ihrer Taktik Taktik hatten hatten sie sie nach nach knapp knapp einer einer ha halbenlben Stunde Stunde Erfolg Erfolg : :Janc Janckerker spitzelte spitzelte den den Ball Ball nach nach einemeinem abgefälschten abgefälschten Freistoß Freistoß von von Michael Michael Tarnat Tarnat ins ins Tor Tor . .Der Der Brasilianer Brasilianer Paolo Paolo Sergio Sergio ( (14 14 . .) )hätte hätte sogar sogar schon schon früherfrüher sein sein Team Team in in Führung Führung schießen schießen können können . .Doch Doch traf traf er er aus aus 14 14 m m nur nur die die Oberkante Oberkante der der Latte Latte des des VfL VfL - - ToresTores . .Die Die Gastgeber Gastgeber besaßen besaßen nur nur eine eine Möglichkeit Möglichkeit in in de der rersten ersten Halbzeit Halbzeit , ,als als de der rstarke starke Spielmacher Spielmacher Dorinel Dorinel MunteanuMunteanu ( (37 37 . .) )mit mit einem einem Schuss Schuss an an dem dem großartig großartig reagierenden reagierenden Nationalt Nationaltorhüterorhüter Oliver Oliver Kahn Kahn scheiterte scheiterte . . NachNach dem dem Wechsel Wechsel wurden wurden die die Wolfsburger Wolfsburger mutiger mutiger und und munterer munterer . .Sie Sie übernahm übernahmenen langsam langsam das das Kommando Kommando . . BeimBeim Ausgleichstreffer Ausgleichstreffer durch durch Juskowi Juskowiakak half half die die Bayern Bayern - - Deckung Deckung allerdi allerdingsngs mit mit . .Samuel Samuel Kuffour Kuffour verlor verlor den den BallBall an an den den polnischen polnischen Nati Nationalspieleronalspieler , ,Juskowiak Juskowiak zog zog sofort sofort ab ab u undnd ließ ließ dem dem besten besten Bayern Bayern - - Spieler Spieler Kahn Kahn keinekeine Chance Chance . .Danach Danach bemühten bemühten sich sich die die Münchner Münchner noch noch einmal einmal und und erhöhten erhöhten den den Druck Druck . .Doch Doch klare klare MöglichkeitenMöglichkeiten besaßen besaßen sie sie nicht nicht mehr mehr . .In In der der hektischen hektischen Schlussphase Schlussphase ve verlorrlor Elber Elber die die Nerven Nerven , ,so so dass dass die die BayernBayern Glück Glück hatten hatten , ,in in Unterzahl Unterzahl ni nichtcht auch auch noch noch zu zu verlieren verlieren . . Prof.dpadpa Dr. Stefan Wrobel yyni yyni ce ce jo jo 17 ……

Text Mining: Creating Semantics in the Real World Unsupervised hierarchical term Clustering: dpa data

Team sport Spiel Bundesliga Team Trainer Sieg Mannschaft Niederlage Samstag Platz Saison Erfolg Punkte Pokal Nationalspieler …

Not football Football Finale Frankfurt deutsche FC Trainer Fußball München Meister Hamburg Zuschauer Spieler Bayern Mannschaft Zuschauern Männer Halle WM Saison Hertha Stürmer Stadion Titelverteidiger Final EM … Spiel SV Dortmund Coach …

Basketball Berlin Weltmeister Kiel Handball Magdeburg Minute Tor VfL Schiedsrichter Bayern League Champions Bonn Kampf K Hagen Trier Flensburg HSG VfL TV THW Zuschauer Minuten Führung Fußball UEFA United Hinspiel Würzburg LBA Playoff Runde Tore Bad Wuppertal Lemgo Tore Hansa Eintracht Schalke Cup Manager Vertrag Leeds Box Berliner Klitschko Titel… Bundesliga Handewitt … Bundesliga Karte Wolfsburg … Club Fans Real Hitzfeld …

Basketball + Boxing Handball German League European League

Prof. Dr. Stefan Wrobel 18 Text Mining: Creating Semantics in the Real World Text Mining Market Size

ƒ „The text mining market has roughly $50-100 million annual product revenue, and is growing at roughly 40-60% annually.“ [Monash 06, texttechnologies.com]

ƒ Sounds small … ƒ But then … • Several research sites devoted to the technology …

ƒ So the real market must be somewhere else …

Prof. Dr. Stefan Wrobel 19

Text Mining: Creating Semantics in the Real World The Text Mining Market … is called “Text Analytics”

Primary areas: ƒ Web search, site search ƒ knowledge management, enterprise portals ƒ Information collection, extraction, harvesting ƒ Email handling, security, spam and phishing filtering ƒ Market research ƒ Online advertising ƒ Specialized markets • litigation, juridical • Patent search

[cf. Monash/2008]

Prof. Dr. Stefan Wrobel 20 Text Mining: Creating Semantics in the Real World Application Field Market Research: Germany 1.6 billion, growing

Both ad-hoc studies and panels can benefit from text mining

http://www.adm-ev.de/zahlen.html

Prof. Dr. Stefan Wrobel 21

Text Mining: Creating Semantics in the Real World as a text mining market

ƒ More than 1.2 billion $ in 2010

Year 2006 2007 2008 2009 2010 Software 717 860 989 1108 1219 revenue Million $

[Gartner 2008]

Prof. Dr. Stefan Wrobel 22 Text Mining: Creating Semantics in the Real World Companies

Prof. Dr. Stefan Wrobel 23

Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Wikinger, Fraunhofer Web • Structuring and Monitoring: Semantic Map, EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 24 Text Mining: Creating Semantics in the Real World Text Mining Tasks

ƒ Document classification, scoring and/or ranking, isolated retrieval • Assign a class, score or rank to an entire document ƒ In-collection, linked retrieval and organization • Find documents in a collection • Link results to other results ƒ Information and relation extraction • Extract pieces of information, fill particular relations ƒ Overview and monitoring of collections • Give summary impression of information in a collection or source

Prof. Dr. Stefan Wrobel 25

Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 26 Text Mining: Creating MotivationSemantics in the Real World Spotting Faked Offers at Internet Auctions

Techniques to sell fakes ƒ Put faked products on an internet auction platform, e.g. ebay ƒ Describe product as forged, falsified, e.g. “very similar to XXX”

Aspects ƒ Infringement of registered trade marks ƒ Violation of patents ƒ Enormous sales volume

Prof. Dr. Stefan Wrobel 27

Text Mining: Creating Semantics in the Real World

Counter Measures Motivation

Use trainable classifiers ƒ Compile training set of genuine and Fakes faked internet auction offers x2 ƒ Train classifiers to detect these classes use text, format information, etc. as features H ƒ Use different classifier for different y pe brands / products rp la ƒ Apply to new internet auction ne offers Originals ƒ Ban faked offers from auction x1 ƒ Update classifiers to new techniques

Prof. Dr. Stefan Wrobel 28 Text Mining: Creating Semantics in the Real World Results

ƒ A classifier was developed and ƒ The Germal Federal Court of Justice: tested Internet Auction providers have to filter the auctions using approriate ƒ Similar techniqes as for spam methods to detect faked offers. detection ƒ Good results: F-value >> 90%

Prof. Dr. Stefan Wrobel 29

Text Mining: Creating MotivationSemantics in the Real World Phishing

E-mail fraud ƒ Send officially looking email ƒ Include web link or form ƒ Ask for confidential information e.g. password, account details ƒ Attacker uses information to withdraw money, enter computer system, etc.

Prof. Dr. Stefan Wrobel 30 Text Mining: Creating MotivationSemantics in the Real World AntiPhish

Project AntiPhish Consortium ƒ Develop content-based phishing filtersƒ Fraunhofer IAIS (DE) ƒ Include other clues like whitelists ƒ Symantec (GB, IRL) ƒ Trainable and adaptive filters ƒ Tiscali (IT) Î adapt to new phishing attacks ƒ Nortel (FR) Î anticipate attacks ƒ K.U. Leuven (BE)

Prof. Dr. Stefan Wrobel 32

Text Mining: Creating MotivationSemantics in the Real World Phishing: Defense Techniques

Workflow ƒ Obtain training data from email ƒ Integrate new filters into email stream filtering framework ƒ Extract features ƒ Deploy at internet service provider ƒ Estimate and update classifiers and ƒ Deploy at central wireless packet filters switch Prof. Dr. Stefan Wrobel 33 Text Mining: Creating Semantics in the Real World Approach: Multiple feature sets

Prof. Dr. Stefan Wrobel 37

Text Mining: Creating Semantics in the Real World Basic Features

Prof. Dr. Stefan Wrobel 38 Text Mining: Creating Semantics in the Real World Dynamic Markov Chains

Prof. Dr. Stefan Wrobel 39

Text Mining: Creating Semantics in the Real World DMC Details

Prof. Dr. Stefan Wrobel 40 Text Mining: Creating Semantics in the Real World Latent Topic Models

Prof. Dr. Stefan Wrobel 41

Text Mining: Creating Semantics in the Real World Class-Specific Topic Models

Prof. Dr. Stefan Wrobel 42 Text Mining: Creating Semantics in the Real World Feature Processing and Selection

Prof. Dr. Stefan Wrobel 43

Text Mining: Creating Semantics in the Real World Test Corpora

Prof. Dr. Stefan Wrobel 46 Text Mining: Creating Semantics in the Real World Overall Result

Prof. Dr. Stefan Wrobel 47

Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 52 Text Mining: Creating Semantics in the Real World

The THESEUS research program in Germany [theseus-program.com] Deutsche Nationalbibliot hek Deutsche Thomson OHG (DTO) Deutsches Forschungszentr um für Künstliche Intelligenz (DFKI GmbH) empolis GmbH Festo AG Fraunhofer- Gesellschaft (7 Institutes) Friedrich-Alexander- Universität

semantic Erlangen FZI Forschungszentrum Informatik Institut für Rundfunktechni k GmbH (IRT) intelligent views gmbh Ludwig-Maximilians- Universität (LMU) moresophy GmbH LYCOS Europe mufin GmbH ontoprise GmbH SAP AG Siemens AG Technische Universität Wess/07 Darmstadt Technische Universität syntactic Dresden Technische Universität München Universität Karlsruhe (TH) Single author Multiple authors Verband Deutscher Maschinen- und Anlagebau e.V. (VDMA)

Prof. Dr. Stefan Wrobel 53

Text Mining: Creating Semantics in the Real World The THESEUS use cases

CONTENTUS MEDICO ALEXANDRIAALEXANDRIA CONTENTUS MEDICO Next Generation Digital Libraries Semantic image Search The Internet Knowledge Platform Next Generation Digital Libraries Semantic image Search The Internet Knowledge Platform forfor saving saving our our cultural cultural heritage heritage inin Medicine Medicine

ORDO TEXO ORDO PROCESSUSPROCESSUS TEXO PersonalPersonal Ordered Ordered Knowledge Knowledge BusinessBusiness Webs Webs in in the the Internet Internet Management Semantic Business Processes Of Things Management Semantic Business Processes Of Things

Prof. Dr. Stefan Wrobel 54 Text Mining: Creating Semantics in the Real World CONTENTUS - Next Generation Digital Libraries for saving our cultural heritage

ƒ Publishers, Libraries, broadcasters, etc. are interested in using, distributing and saling their archive content ƒ In analog form archives are threatened by deterioration, are not linked, difficult to use, and huge.

Goals: ƒ Digitalization, optimization of quality, availability ƒ Indexing, semantic and social linking and intelligent search, communities Laufzeit bis ƒ Rescue of cultural heritage, preventing losses from deterioration 2012

Prof. Dr. Stefan Wrobel 55

Text Mining: Creating Semantics in the Real World Showcases Semantic Digital Libraries

ƒ 225 years Neue Zürcher Zeitung NZZ

ƒ GDR music archive German National Library

Prof. Dr. Stefan Wrobel 56 CONTENTUS Workflow Text Mining: Creating Semantics in the Real World Workflow

Data generation: registered users / communities, Controlled quality Data generation: registered algorithms with acceptable Data generation: automatically users / communities, quality generatedControlled through quality high-quality algorithms with acceptable Quality control: self control algorithmsData generation: automatically quality (see Wikipedia) Qualitygenerated control: throughtraining high-quality and Quality control: self control improvementalgorithms of algorithms (see Wikipedia) Quality control: training and improvement of algorithms

Guaranteed quality Shell Data generation and Mantle correctionGuaranteed: Libraries, quality museums, Data generation and Core universities, experts, etc. Qualitycorrection control:: Libraries,Schooling, museums, rules,universities, advisory boards experts, etc. HighestQuality stability, control: highestSchooling, persistencerules, advisory boards Highest stability, highest 1 2 3 4 5 persistence 6 Digitization Automated Automated Semantic Open Semantic optimization of generation of linking of knowledge access to quality metadata metadata networks – user knowledge augmentation and content

Prof. Dr. Stefan Wrobel 57

Digitalisierung Text Mining: Creating Semantics in the Real World

Open Semantic 1 2 Automatic 3 Automated 4 Semantic 5 knowledge 6 access to Digitization Optimization Linking of networks – Generation of knowledge of quality metadata user metadata and content augmentation

ƒ High-Throughput Methods ƒ Modern book scanners: Thousands of pages per day ƒ Almost fully automatic

Data volumes: 70TB (NZZ), Peta-Exabytes (DNB)

Prof. Dr. Stefan Wrobel 58 Digitalisierung Text Mining: Creating Semantics in the Real World

1 DigitalisierungDigitization

ƒ High-Throughput Methods ƒ Modern book scanners: Thousands of pages per day ƒ Almost fully automatic

Data volumes: 70TB (NZZ), Peta-Exabytes (DNB)

Prof. Dr. Stefan Wrobel 59

Qualitätsoptimierung Text Mining: Creating Semantics in the Real World

1 2 Automated DigitalisierungDigitalization Optimization of quality Margin removal

ƒ Development of intelligent Sharpening, algorithms for Straightening optimizing print, images, sound & movies ƒ Automated generation of Denoising, declicking presentation formats Scratch removal

Prof. Dr. Stefan Wrobel 60 Metadatengenerierung Text Mining: Creating Semantics in the Real World

1 2 Automated 3 Automated DigitalisierungDigitalization Optimization Generation of of quality metadata

ƒ Structural and contentual metadata ƒ OCR, speech, music, video recognition ƒ Structure analysis and type recognition ƒ Linking with current norms & standards

Prof. Dr. Stefan Wrobel 61

Semantische Vernetzung Text Mining: Creating Semantics in the Real World

1 2 Automated 3 Automated 4 Semantic DigitalisierungDigitalization Optimization Generation of linking of of quality metadata contents

ƒ Link-up with related media ƒ Incorporation of external knowledge sources (metadata systems, Wikipedia, …) ƒ Disambiguation, classification, relation extraction

Prof. Dr. Stefan Wrobel 62 Text Mining: Creating Semantics in the Real World Determining meaning

The words of natural language are often Über Kohl höhnte Strauß: „Er wird nie Kanzler ambiguous werden. Die Zeit, 18.7.08

» For each word / term, find a meaning » Subproblem: » Part of : Nouns, Verb, Adjective, … » Named entity recognition: People, Places, Organizations, … » Assignment of concepts: Plant, Bird, Politician, …

Prof. Dr. Stefan Wrobel 63

Text Mining: Creating Semantics in the Real World Named entity recognition

» Analyze Über Kohl höhnte Strauß: „Er wird nie Kanzler Surroundings of werden. Words » “Kohl” in a sentence with “Kanzler” Î probably “person” » “Kohl” in a sentence with “kochen” Î probably “vegetable”

» Statistical model for person names » Word + Surroundings -> word is a person » Training using annotated sentences.

» Automatic Recognition of words / phrases that represent people

Prof. Dr. Stefan Wrobel 64 Text Mining: Creating Semantics in the Real World Conditional Random Field Model

» Observed words X1,…,Xn

» Category of words Y1,…,Yn 1 ⎛ N ⎡ N2 ⎤⎞ p(Y , ,Y | X ) = exp⎜ λ f (Y ,Y ,X) ⎟ 1 K n ⎜ ∑∑⎢ k k,C t−1 t ⎥⎟ Z(X,λ,μ) ⎝t==11⎣k ⎦⎠ » Properties f may depend on two subsequent states and on all observed words

Example

» Property f10293 has value 1, -if Yt-1=“PER" and Yt=“PER” and - Xt has value “Müller”. Otherwise its value is 0. [Lafferty, McCallum, Pereira 01]

Prof. Dr. Stefan Wrobel 66

Text Mining: Creating Semantics in the Real World Modeling of names: features for a CRF model

» Title FirstName Connective LastName

» Properties. Recorded for the words xt-2,xt-1,xt,xt+1,xt+2 » Words, stem, part of speech » Prefix, Suffix (3 letters) » Shape properties Capital characters at the beginning, only numbers, contains numbers, mix capital /no capital, contains hyphens » LDA class » Contained in list of first names, contained in list of last names eit Arb In

Prof. Dr. Stefan Wrobel 67 Text Mining: Creating Semantics in the Real World Identity of names

» There are several people named “Helmut Kohl” » Helmut Kohl, born 1930, Chancellor » Helmut Kohl, born 1943, Referee » Helmut Kohl, textile merchand » … 99 further hits in the telephone book

» Identification in Wikipedia » Compare words of Wikipedia-article with the text in which “Helmut Kohl” was found » Similar words -> similar person

» Automated assignment: Person name -> Wikipedia article

Prof. Dr. Stefan Wrobel 68

Text Mining: Creating Semantics in the Real World Assignment of similarity from the environment

Simple for assigning people to Wikipedia article » Occurence in text: Helmut Kohl » Description using characteristic terms -> x » Wikipedia article on Kohl » Description by characteristic terms -> w » Comparison using a distance metric: for example Cosine distance d(w,u)

» Implemented in a prototype

» Further approach: Assignment as a classification task f(w,u) = 0 or 1 Master Thesis

Prof. Dr. Stefan Wrobel 70 Semantische Interpretation Text Mining: Creating Semantics in the Real World Semantic Interpretation

Currently assign semantic categories in the Contentus Prototype

» Names: People, Organizations, points in time, places, …

» Assignment to Wikipedia articles

Under development:

» Hypernyms in ontology (GermaNet): Nouns, Verbs Î Supersenses

» Cluster of words with similar meaning: Topics

» Relations between names / concepts “Berthold Brecht” studied in “München”

» Classes of documents: Politics, Economy, …

Prof. Dr. Stefan Wrobel 72

Text Mining: Creating Semantics in the Real World

Knowledge store Helmut Kohl Geburtsdatum 30.4.1930

» Further information for entities that Geburtsort Ludwigshafen were found in the text Ehegatte Hannelore K. Ausbildung Historiker » Dates, publications Religion katholisch » Number of inhabitants, topological Partei CDU relationships

Berlin

Fläche 891 km2

»Socialnetworks Einwohner 3.420.786 » Who knows whom? BIP 83,6 Mrd. €

»Whowas at thesameplaceat thesametime? Höhe 34–115 m

» Who influenced whom? Geo. Breite 52° 31′ N

Geo. Länge 13° 25′ O

Prof. Dr. Stefan Wrobel 87 Text Mining: Creating Semantics in the Real World Knowledge store: Format

» Factual knowledge as logical expressions: » » Semantic-Web-Standards » RDF » RDFS » OWL » Technical Basis » Database MySQL » Triple-Store Jena + Joseki » Query language » SPARQL

Prof. Dr. Stefan Wrobel 88

Buchmesse Frankfurt 10.Oktober 2008 | 88

Wissensvernetzung Text Mining: Creating Semantics in the Real World Linking of knowledge

» Semantic Integration of data and information from different sources » DBPedia: an interpreted form of Wikipedia » Geonames Ontology: all the places in the world » Catalogue of the German national library: Books and publications

Î Triplestore

» Based on open standards » W3C Stack » RDF, RDFS, OWL, SPARQL

Prof. Dr. Stefan Wrobel 89 Text Mining: Creating Semantics in the Real World Knowledge sources

» DBPedia (www..org) » GeoNames Ontology » Already in RDF/OWL-Format » Person reference database PND » Topic reference database SWD » Online catalogue OPAC » Partial export to RDF » Found entities in the text » Identification using Wikipedia » Linking with DBPedia-Daten per Link

Prof. Dr. Stefan Wrobel 90

Buchmesse Frankfurt 10.Oktober 2008 | 90

Offene Wissensnetzwerke Text Mining: Creating Semantics in the Real World

Open 1 2 Automated 3 Automated 4 Semantic 5 knowledge DigitalisierungDigitalization Optimization Generation of linking of networks – of quality metadata metadata user augmentation

ƒ Further annotations from experts and users • Completions, corrections • Cooperation with the ALEXANDRIA project in Theseus • Suitable measures to assure high quality of data

Prof. Dr. Stefan Wrobel 91 Offene Wissensnetzwerke Text Mining: Creating Semantics in the Real World The Multiple Shell Model Cf. Wikinger [Bröcker et.al. 08]!

OpenOpen knowledge knowledge network network Outer DataData generation generation: :Registered Registered users users / /Communities, Communities, AlgorithmsAlgorithms Quality control: Self control (cf. Wikipedia) Mantel Quality control: Self control (cf. Wikipedia) ControlledControlled Quality Quality DataData generation generation: :Algorithms Algorithms of of high high quality quality Core QualityQuality control: control:TrainingTraining and and improvement improvement of of algorithmsalgorithms

AssuredAssured quality quality DataData generation generation and and correction correction: :Libraries, Libraries, Universities,Universities, Museums, Museums, groups groups of of experts, experts, etc. etc. Quality control: Fixed rules, committes, Prof. Dr. Stefan Wrobel Quality control: Fixed rules, committes, 92 maximalmaximal Stability Stability and and Persistence Persistence

Semantische Suche Text Mining: Creating Semantics in the Real World

Open Semantic 1 2 Automated 3 Automated 4 Semantic 5 knowledge 6 access to DigitalisierungDigitalization Optimization linking of networks – Generation of knowledge of quality metadata user metadata and content augmentation

ƒ The knowledge network • Digital, multimedia data • Content is semantically linked • Is enriched from external sources and user groups ƒ Access • Structure by Ontology • Content relationships become clear • “Knowledge exploration” is possible

Prof. Dr. Stefan Wrobel 93 Text Mining: Creating Semantics in the Real World The Contentus Demonstrator

Prof. Dr. Stefan Wrobel 95

Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 102 Text Mining: Creating Semantics in the Real World Example of a classified webpage

Übereinstimmung zu dem Dokumentmodell = 80% Klassifikation als = Projekte

Prof. Dr. Stefan Wrobel 107

Text Mining: Creating Semantics in the Real World Workflow for semantic processing of documents

9!! 200009 Pre- Categoriza Entity err 2 processing tion Recognitio Extractedm e mMetadatam Search index n su m iinn su ncchh Crawl lauun Documents eela hherer fortrt t fo Using the document model oouut tcchh Waat W Extracted Knowledge Search Store regions Using the structure model

Prof. Dr. Stefan Wrobel 111 Text Mining: Creating Semantics in the Real World Outline

ƒ Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

ƒ Conclusion

Prof. Dr. Stefan Wrobel 112

Text Mining: Creating Semantics in the Real World Emotion Radar

Which issues are important to people, where are the emotional discussions in and discussion forums?

Goal: market research, …

Prof. Dr. Stefan Wrobel 113 Text Mining: Creating Semantics in the Real World Emotion Radar example: Looking at two large automobil companies A and B*

ƒ Selection and Crawling of discussion forums

• Criteria: ranking, size, activity

• Period used: January 2008 to January 2009

• Storage: 2 GB of data crawl during a period of 7 days

ƒ Structure analysis of the discussion forums:

• Manufacturer A:

– Number of postings: 188.487 – Monthly number of new postings: ca. 1.500 – Number of threads: 21.613 – Number of authors: 15.445 • Manufacturer B:

– Number of postings: 406.814 – Monthly number of new postings: ca. 2.700 – Number of threads: 38.758 – Number of authors: 21.919

* anonymisiert

Prof. Dr. Stefan Wrobel 114

Text Mining: Creating Semantics in the Real World Case study: Internet postings related to the introduction of a new car model in Germany 2008

Cars are delivered

Manufacturer publishes further product features/start of sales

Manufacturer publishes first pictures

*Prof. anonymisiert Dr. Stefan Wrobel 115 Text Mining: Creating Semantics in the Real World Partially automated emotion analysis shows a mood swing from positive to negative („in love-> angry“)

Cars are delivered

− „angry“− „angry“

− „proud“− „proud“

− „angry“− „angry“ Manufacturer publishes further product features/start of sales

Manufacturer − „surprised“− „surprised“ publishes first pictures

− „turned− „turned off“ off“

„in„in love“ love“ − „hoping“− „hoping“ − „interested“− „interested“

*Prof. anonymisiert Dr. Stefan Wrobel 116

Text Mining: Creating Semantics in the Real World

Topic recognition shows a change of product features that are discussed from design to gasoline consumption

Auslieferungen

− Probefahrten− Probefahrten

− Verbrauch− Verbrauch − Verbrauch,− Verbrauch, − „verärgert“− „verärgert“ Klappschlüssel,Klappschlüssel, AudioAudio

− „verärgert“− „verärgert“ − Preise− Preise Car- Car- KonfiguratorKonfigurator − Nachbarn− Nachbarn − Chromleisten,− Chromleisten, undund Liste Liste − „stolz“ WertanmutungWertanmutung − „stolz“ Erste Fotos − „überrascht“− „überrascht“ − „Riesen-− „Riesen- „...kaum zu glauben fischmaul“fischmaul“ „...kaum zu glauben was dieses kleine − „abgestoßen“− „abgestoßen“ was dieses kleine − Schiebedach,− Schiebedach, Auto an Benzin − Schaltung,− Schaltung, Lenkrad,Lenkrad, Auto an Benzin Effizienz,Effizienz, verbraucht!“ Bordcomputer,Bordcomputer, TechnologieTechnologie verbraucht!“ AudiosystemAudiosystem (Kalle83) − Fahrzeug-− Fahrzeug- (Kalle83) länge,länge, Design Design − Verbrauch− Verbrauch − „zugeneigt“− „zugeneigt“

− „verliebt“− „verliebt“ − „hoffend“− „hoffend“

*Prof. anonymisiert Dr. Stefan Wrobel 117 Text Mining: Creating Semantics in the Real World How can manufacturers use these text mining results?

Auslieferungen e.g.Z.B. short durch term einerecognition frühzeitige of relevant

− Probefahrten− Probefahrten topicsErkennung(consumption) relevanter and Themen preparation − Verbrauch− Verbrauch of(Verbrauch) appropriate und resonse Ableiten(gas von saver − Verbrauch,− Verbrauch, − „verärgert“− „verärgert“ Klappschlüssel,Klappschlüssel, trainings,Maßnahmen fuel efficient(Spritspartrainings, tires, proactive AudioAudio communikation)Leichtlaufreifen, Kommunikation) − „verärgert“− „verärgert“ − Preise− Preise Car- Car- KonfiguratorKonfigurator − Nachbarn− Nachbarn − Chromleisten,− Chromleisten, undund Liste Liste − „stolz“ WertanmutungWertanmutung − „stolz“ − „überrascht“− „überrascht“ − „Riesen-− „Riesen- fischmaul“fischmaul“

− „abgestoßen“− „abgestoßen“ − Schiebedach,− Schiebedach, − Schaltung,− Schaltung, Lenkrad,Lenkrad, Effizienz,Effizienz, Bordcomputer,Bordcomputer, TechnologieTechnologie AudiosystemAudiosystem − Fahrzeug-− Fahrzeug- ... Long term continuous länge,länge, Design Design − Verbrauch− Verbrauch − „zugeneigt“− „zugeneigt“

− „verliebt“− „verliebt“ − „hoffend“− „hoffend“ monitoring of emotional topics

*Prof. anonymisiert Dr. Stefan Wrobel 118

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel 119 Text Mining: Creating Semantics in the Real World Summary

ƒ Text Mining is cool! • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets

ƒ We can do a lot with Text Mining in the Real World! • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar

Prof. Dr. Stefan Wrobel 120

Text Mining: Creating Semantics in the Real World The fine print: Papers and further reading

ƒ Paaß, Gerhard; Reinhardt, Wolf; Rüping, Stefan; Wrobel, Stefan: Data ƒ Horváth, Tamás; Ramon, Jan:Efficient frequent connected subgraph mining in mining for security and crime detection In: Gal, Cecilia S. (Ed.) et al.: graphs of bounded treewidth: Machine learning and knowledge discovery in Security informatics and terrorism: social and technical problems of database: ECML PKDD 2008. Berlin [u.a.]: Springer, 2008. (Machine learning and detecting and controlling terrorists' use of the World Wide Web ; knowledge discovery in 1), S. 520-535 proceedings of the NATO Advanced Research Workshop on Security ƒ Kolb, Inke; Deutschland / Bundesbeauftrager für Kultur und Medien; Fraunhofer- Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 Institut IAIS: Auf dem Weg zur Deutschen Digitalen Bibliothek (DDB): erstellt im June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series Auftrag des Beauftragten der Bundesregierung für Kultur und Medien. 2008 D, Information and Communication Security 15), S. 56-70 ƒ Köhler, Joachim; Larson, Martha; Jong, Franciska de Jong; Kraaij, Wessel; ƒ Paaß, Gerhard; Kindermann, Jörg: Entity and relation extraction in texts Ordelman, Roeland; Association for Computing Machinery / Special Interest Group with semi-supervised extensions In: Gal, Cecilia S. (Ed.) et al.: Security on : Proceedings of the ACM SIGIR Workshop "Searching informatics and terrorism: social and technical problems of detecting Spontaneous Conversational Speech": held in conjunction with the 31th Annual and controlling terrorists' use of the World Wide Web ; proceedings of International ACM SIGIR Conference 24 July 2008, Singapore, 2008 the NATO Advanced Research Workshop on Security Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007. ƒ Krausz, Barbara; Herpers, Rainer: Event detection for video surveillance using an Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D, expert system In: Association for Computing Machinery / Special Interest Group on Information and Communication Security 15), S. 132-141 Multimedia: 1st ACM International Workshop in Analysis and Retrieval of Events/Actions and Workflows in Video Streams (AREA 2008): October 31, 2008, ƒ Frank Reichartz and Gerhard Paaß. Estimating Supersenses with Vancouver, Canada ; in conjunction with ACM Multimedia 2008. New York, NY: Conditional Random Fields. Workshop on High-Level Information ACM, 2008, S. 49-55 Extraction, ECML/PKDD 2008. ƒ Lioma, Christina; Moens, Marie-Francine; Gomez, Juna-Carlos; De Beer, Jan; ƒ Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel, Bergholz, Andre; Paass, Gerhard; Horkan, Patrick: Anticipating Hidden Text Salting Marie-Francine Moens and Brian Witten: Detecting Known and New in : extended abstract. In: Lippmann, Richard (Ed.) et al.: Recent advances in Salting Tricks in Unwanted Emails Fifth Conference on Email and Anti- intrusion detection: 11th international symposium, RAID 2008, Cambridge, MA, Spam, CEAS 2008, Aug 21-22, 2008 USA, September 15-17, 2008 ; proceedings. Berlin [u.a.]: Springer, 2008. ƒ Andre Bergholz,Jeong-Ho Chang, Gerhard Paass, Frank Reichartz and ƒ Anja Pilz, Lukas Molzberger, and Gerhard Paa. Entity resolution by kernel methods. Siehyun Strobel. Improved Phishing Detection using Model-Based In Proc. Sabre TMS, 2009. Features. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 21- 22, 2008, Mountain View, Ca. ƒ Andre Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard Paass, Siehyun Strobel. New Filtering Approaches for Phishing Email. Accepted for ƒ Stefan Eickeler, Lars Br¨ocker, and Ruth Haener. NZZ: 225 Jahre Old publication for Journal of Computer Security (JCS) economy vernetzt - Realisierung des digitalen Archivs der Neuen Zürcher Zeitung. In GI Jahrestagung, pages 73–77, 2005. ƒ Gerhard Paass and Frank Reichartz (2009): Exploiting Semantic Constraints for Estimating Supersenses with CRFs. Proc. SDM 2009 (accepted for publication)

Prof. Dr. Stefan Wrobel 121



© 2022 Docslib.org