<<

9. Text and Document Analytics

Jacobs University Visualization and Computer Graphics Lab Unstructured Data

Data Mining / Knowledge Discovery

Structured Data Multimedia Free Text Hypertext

HomeLoan ( Frank Rizzo bought Frank Rizzo Loanee: Frank Rizzo his home from Lake Bought Lender: MWF View Real Estate in this home Agency: Lake View 1992. from Lake Amount: $200,000 He paid $200,000 View Real Estate Term: 15 years under a15-year loan In 1992. ) Loans($200K,[map],...) from MW Financial.

...

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 312 Unstructured Data

• There are many examples of text-based documents (all in ‘electronic’ format…) – e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories, etc.

• Not enough time or patience to read – Can we extract the most vital kernels of information?

• Find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 313 Tasks

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 314 9.1

Jacobs University Visualization and Computer Graphics Lab Text Mining

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 316 Text Mining

• Typically falls into one of two categories: – Analysis of text: I have a bunch of text I am interested in, tell me something about it • E.g. , “buzz” searches –Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E.g. web search, library catalogs, legal and medical precedent studies

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 317 Text Mining: Analysis

• Which are most present • Which words are most surprising •Which words help define the document • What are the interesting text phrases

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 318 Text Mining: Retrieval

• Find k objects in the corpus of documents which are most similar to my query. • Can be viewed as “interactive” data mining - query not specified a priori. • Main problems of text retrieval: – What does “similar” mean? – How do I know if I have the right documents? – How can I incorporate user feedback?

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 319 9.2 Text Retrieval

Jacobs University Visualization and Computer Graphics Lab Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance between two sentences or queries? • Evaluating retrieval is hard: what is the “right” answer? (no ground truth) • User can query things you have not seen before e.g. misspelled, foreign, new terms. • Goal (score function) is different than in classification/regression: not looking to model all of the data, just get best results for a given user. • Words can hide semantic content – Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining –Polysemy: The same keyword may mean different things in different contexts, e.g., mining

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 321 Basic Measures for Text Retrieval

• Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) |{Relevant}∩{Retrieved}| precision= |{Retrieved}| •Recall:the percentage of documents that are relevant to the query and were, in fact, retrieved

|{Relevant}∩{Retrieved}| recall= |{Relevant}| Jacobs University Visualization and Computer Graphics Lab

Data Analytics 322 Precision vs. Recall

Truth:Relevant Truth:Not Relevant actual outcome

Algorithm:Relevant TP FP 10

Algorithm: Not FN TN predicted 1 Relevant outcome ab

– Precision = TP/(TP+FP) 0 cd – Recall = TP/(TP+FN) – Trade off: • If algorithm is ‘picky’: precision high, recall low • If algorithm is ‘relaxed’: precision low, recall high

– BUT: recall often hard if not impossible to calculate

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 323 Precision Recall Curves

•If we have a labeled training set, we can calculate recall. • For any given number of returned documents, we can plot a point for precision vs recall • Different retrieval algorithms might have very different curves - hard to tell which is “best”

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 324 9.3 Frequency Analysis

Jacobs University Visualization and Computer Graphics Lab Bag-of-Tokens Approaches

Documents Token Sets

Four score and seven years nation – 5 ago our fathers brought civil - 1 forth on this continent, a war – 2 new nation, conceived in Feature men – 2 Liberty, and dedicated to Extraction died – 4 the proposition that all men people – 5 are created equal. Liberty – 1 Now we are engaged in a God – 1 great civil war, testing … whether that nation, or …

Loses all order-specific information! Severely limits context! Jacobs University Visualization and Computer Graphics Lab

Data Analytics 326 Term-document Matrix

• Most common form of representation in text mining is the term-document matrix – Term: typically a single , but could be a word phrase like “data mining” – Document: a generic term meaning a collection of text to be retrieved – Can be large - terms are often 50k or larger, documents can be in the billions (www). – Can be binary, or use counts

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 327 Term-document Matrix

Example: 10 documents, 6 terms

Database SQL Index Regression Likelihood linear

D124219003

D232105030

D312165000

D4672000

D5 43 31 20 0 3 0

D62001876

D7 0 0 1 32 12 0

D83002244

D91 0 0 342725

D1060017423

Each document now is just a vector of terms, sometimes Boolean

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 328 Term-document Matrix

• We have lost all semantic content • Be careful constructing your term list! – Not all words are equal – Words that are the same should be treated the same •Stop words •

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 329 Stop Words

• Many of the most frequently used words in English are worthless in retrieval and text mining – these words are called stop words. – the, of, and, to, …. – Typically about 400 to 500 such words – For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? – Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. – Improve effectiveness • stop words are not useful for searching or text mining • stop words always have a large number of hits

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 330 Stemming

• Techniques used to find out the root/stem of a word: – E.g., user engineering users engineered used engineer using – stem: use engineer • Usefulness: – improving effectiveness of retrieval and text mining • matching similar words – reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. Jacobs University Visualization and Computer Graphics Lab

Data Analytics 331 Basic stemming methods

• remove ending – if a word ends with a consonant other than s, followed by an s, then delete s. – if a word ends in es, drop the s. – if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. – if a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. – …... • transform words – if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 332 Feature Selection

• Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms – Even after stemming and stopword removal.

• Greedy search – Start from full set and delete one at a time – Find the least important variable

• Often performance does not degrade even with orders of magnitude reductions

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 333 9.4 Comparisons

Jacobs University Visualization and Computer Graphics Lab Distances of documents

• Interpret documents as strings. • Then, we can compute how closely to documents match each other using the . • The Levenshtein distance between two words is the minimum number of single- edits (i.e. insertions, deletions or substitutions) required to change one word into the other. • Formally, lev(|a|,|b|) is computed by

• However, documents we are looking into are typically not that close and the Levenshtein distances would not produce meaningful results when comparing a larger set of documents.

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 335 Distances in TD matrices • Given a term-document matrix representation, we can define distances between documents • Elements of matrix can be 0,1 or term frequencies (sometimes normalized) • Can use Euclidean or cosine distance • Cosine distance is the angle between the two vectors

• If docs are the same, dc=1, if nothing in common, dc=0 • Maybe not so intuitive, but has been proven to work well for sparse data sets • Term-document matrices are typically sparse

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 336 Distances in TD matrices

• We can calculate cosine and Euclidean distance for this matrix • What would you want the distances to look like?

Database SQL Index Regression Likelihood linear

D124219003

D232105030

D312165000

D4672000

D5 43 31 20 0 3 0

D62001876

D7 0 0 1 32 12 0

D83002244

D91 0 0 342725

D1060017423

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 337 Document distance

• Pairwise distances between documents • Image plots of cosine distance, Euclidean, and scaled Euclidean:

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 338 Weighting in TD space

• Not all phrases are of equal importance – E.g. David less important than Beckham – If a term occurs frequently in many documents it has less discriminatory power – One way to correct for this is inverse-document frequency (IDF).

–nj = # of docs containing the term – N = total # of docs – Term importance = Term Frequency (TF) x IDF – A term is “important” if it has a high TF and/or a high IDF. – TF x IDF is a common measure of term importance

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 339 Weighting in TD space

Datab SQL Index Regre Likelih linear Databa SQL Index Regres Likelih linear ase ssion ood se sion ood

D1 24 21 9 0 0 3 D1 2.53 14.6 4.6 0 0 2.1 D2 32 10 5 0 3 0 D2 3.3 6.7 2.6 0 1.0 0

D3 12 16 5 0 0 0 D3 1.3 11.1 2.6 0 0 0

D4672000 D4 0.7 4.9 1.0 0 0 0 D5 43 31 20 0 3 0 D5 4.5 21.5 10.2 0 1.0 0 D62001876 D6 0.2 0 0 12.5 2.5 11.1

D700132120 D7 0 0 0.5 22.2 4.3 0 D83002244 D8 0.3 0 0 15.2 1.4 1.4 D9 1 0 0 34 27 25 D9 0.1 0 0 23.56 9.6 17.3

D1060017423 D10 0.6 0 0 11.8 1.4 16.0

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 340 Queries • A query is a representation of the user’s information needs – Normally a list of words. • Once we have a TD matrix, queries can be represented as a vector in the same space – “Database Index” = (1,0,1,0,0,0) • Query can be a simple question in natural language

• Calculate cosine distance between query and the TF x IDF version of the TD matrix • Returns a ranked vector of documents

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 341 9.5 Latent Semantic Indexing

Jacobs University Visualization and Computer Graphics Lab Latent Semantic Indexing

• Criticism: queries can be posed in many ways, but still mean the same – Data mining and knowledge discovery – Car and automobile – Beet and beetroot • Semantically, these are the same, and documents with either term are relevant. • Using synonym lists or thesauri are solutions, but messy and difficult. • Latent Semantic Indexing (LSI) tries to extract hidden semantic structure in the documents • Search what I meant, not what I said!

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 343 LSI

• Approximate the T-dimensional term space using principle components calculated from the TD matrix • The first k PC directions provide the best set of k orthogonal basis vectors - these explain the most variance in the data. – Data is reduced to an N x k matrix, without much loss of information • Each “direction” is a linear combination of the input terms, and define a clustering of “topics” in the data. • What does this mean for our toy example?

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 344 LSI

Datab SQL Index Regre Likelih linear Databa SQL Index Regres Likelih linear ase ssion ood se sion ood

D1 24 21 9 0 0 3 D1 2.53 14.6 4.6 0 0 2.1 D2 32 10 5 0 3 0 D2 3.3 6.7 2.6 0 1.0 0

D3 12 16 5 0 0 0 D3 1.3 11.1 2.6 0 0 0

D4672000 D4 0.7 4.9 1.0 0 0 0 D5 43 31 20 0 3 0 D5 4.5 21.5 10.2 0 1.0 0 D62001876 D6 0.2 0 0 12.5 2.5 11.1

D700132120 D7 0 0 0.5 22.2 4.3 0 D83002244 D8 0.3 0 0 15.2 1.4 1.4 D9 1 0 0 34 27 25 D9 0.1 0 0 23.56 9.6 17.3

D1060017423 D10 0.6 0 0 11.8 1.4 16.0

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 345 LSI

• Typically done using Singular Value Decomposition (SVD) to find principal components:

New orthogonal basis for the data (PC directions) - TD matrix Diagonal Term matrix of weighting by eigenvalues document - 10 x 6 For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8) Fraction of the variance explained (PC1&2) = = 92.5%

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 346 LSI

Top 2 PC make new pseudo-terms to define documents…

Also, can look at first two principal components: (0.74,0.49, 0.27,0.28,0.18,0.19) -> emphasizes first two terms (-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 347 LSI

• Here we show the same plot, but with two new documents, one with the term “SQL” 50 times, another with the term “Databases” 50 times. • Even though they have no phrases in common, they are close in LSI space

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 348 9.6 Textual Analysis

Jacobs University Visualization and Computer Graphics Lab Textual analysis

• Once we have the data into a nice matrix representation (TD, TDxIDF, or LSI), we can throw the data mining toolbox at it: – Classification of documents • If we have training data for classes –Clustering of documents • unsupervised – Visual analytics methods •MDS

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 350 Automatic

•Motivation – Automatic classification for the tremendous number of on-line text documents (Web pages, e-mails, etc.) – Customer comments: Requests for info, complaints, inquiries • A classification problem – Training set: Human experts generate a training data set – Classification: The computer system discovers the classification rules – Application: The discovered rules can be applied to classify new/unknown documents • Techniques – Linear/logistic regression, naïve Bayes – Trees not so good here due to massive dimension, few interactions

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 351

• Can also do clustering (unsupervised learning) of documents. • Automatically group related documents based on their content. • Require no training sets or predetermined taxonomies. •Major steps –Preprocessing • Remove stop words, stem, feature extraction, lexical analysis, … – Hierarchical clustering • Compute similarities applying clustering algorithms, … • Like all clustering examples, success is relative

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 352 Document Clustering

• Latent Dirichlet Allocation (LDA) is a generative probabilistic model of a corpus. Documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words. • Three concepts: words, topics, and documents • Documents are a collection of words and have a probability distribution over topics • Topics have a probability distribution over words • Fully Bayesian Model

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 353 LDA example

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 354 LDA example

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 355 LDA example

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 356 9.7 Visualization

Jacobs University Visualization and Computer Graphics Lab Visualizing one document

• Word clouds (tag clouds): Size encodes frequency (word counts)

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 358 Visualizing one document

• Different layouts and styles possible:

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 359 Visualizing Derived Properties

•Sentiment Analysis: – Automatically determine tone in text: positive, negative or neutral – Typically uses collections of good and bad words (cf. http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysiswas caught telling an ) – “While the traditional media is McCain’sreluctance[.]”son. Jonathan straight Alter’s talking image of salt, his isn’t quiteoutright ready lie to give up on their favorite of McCain after he base bizarre defenseslowly, perfectly starting captures to take that John with increasingly large grains – Often fit using Naïve Bayes

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 360 Visualizing Derived Properties

• Sentiment Analysis on Social Media Data:

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 361 Visualizing Document Collections

• CBR comprises 680 documents in four different topics, with the number of documents unbalanced between labels. • A bag of words representation has been created with 1,423 terms. • Similarly, the KDViz documents have been collected from an Internet repository, again addressing four topics, with 1,624 objects, 520 dimensions and four highly unbalanced labels.

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 362 Visualizing Document Collections

• CBR: 680 documents. 1,423 terms. • KDViz: 1,624 documents. 520 terms.

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 363 Visualizing Document Collections

• Classification leads to labeled data:

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 364 Visualizing Document Collections

• Visualization-based interactive clustering:

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 365 9.8 Natural Language Processing

Jacobs University Visualization and Computer Graphics Lab Natural Language Processing

A dog is chasing a boy on the playground Lexical Det Noun Aux Verb Det Noun Prep Det Noun analysis (part-of-speech Noun Phrase tagging) Noun Phrase Complex Verb Noun Phrase

Prep Phrase Semantic analysis Verb Phrase Syntactic analysis Dog(d1). () Boy(b1). Playground(p1). Verb Phrase Chasing(d1,b1,p1). + Sentence Scared(x) if Chasing(_,x,_). A person saying this may be reminding another person to get the dog back… Scared(b1) Inference Pragmatic analysis (speech act) Jacobs University Visualization and Computer Graphics Lab

Data Analytics 367 General NLP—Too Difficult!

•Word-level ambiguity – “design” can be a noun or a verb (Ambiguous POS) – “root” has multiple meanings (Ambiguous sense) • Syntactic ambiguity – “natural language processing” (Modification) – “A man saw a boy with a telescope.” (PP Attachment) • Anaphora resolution – “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) • Presupposition – “He has quit smoking.” implies that he smoked before.

Humans rely on context to interpret (when possible). This context may extend beyond a given document!

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 368 Shallow Linguistics

Progress on Useful Sub-Goals: • English Lexicon • Part-of-Speech Tagging • Word Sense Disambiguation • Phrase Detection / Parsing

Jacobs University Visualization and Computer Graphics Lab

Data Analytics 369 WordNet

An extensive lexical network for the English language • Contains over 138,838 words. • Several graphs, one for each part-of-speech. • Synsets (synonym sets), each defining a semantic sense. • Relationship information (antonym, hyponym, meronym …) • Downloadable for free (UNIX, Windows)

watery parched

moist wet dry arid

synonym damp anhydrous antonym Jacobs University Visualization and Computer Graphics Lab

Data Analytics 370 Part-of-Speech Tagging

Training data (Annotated text) This sentence serves as an example of annotated text… Det N V1 P Det N P V2 N

This is a new sentence. “This is a new sentence.” POS Tagger Det Aux Det Adj N

Pick the mostpw(11 likely ,..., wkk , ttag ,..., tsequence. )

⎧p(tw11| )... ptwpwpw ( kk| )( 1 )...( k ) ⎪ pw( ,..., w , t ,..., t ) = ⎨ k 11kk pw(|)(|) t pt t Independent assignment ⎪∏ ii ii−1 Most common tag ⎧p(tw11 | )... ptwpwpw (kk | )⎩ (i=1 1 )... ( k ) ⎪ k = ⎨ Partial dependency ⎪∏ pw(|)(|)ii t pt ii t−1 (HMM) ⎩ i=1 Jacobs University Visualization and Computer Graphics Lab

Data Analytics 371 Word Sense Disambiguation ? “The difficulties of computational linguistics are rooted in ambiguity.” N Aux V P N Supervised Learning Features: • Neighboring POS tags (N Aux V P N) • Neighboring words (linguistics are rooted in ambiguity) • Stemmed form (root) • Dictionary/Thesaurus entries of neighboring words • High co-occurrence words (plant, tree, origin,…) • Other senses of word within discourse Algorithms: •Rule-based Learning • Statistical Learning (i.e. Naïve Bayes) • Unsupervised Learning (i.e. Nearest Neighbor) Jacobs University Visualization and Computer Graphics Lab

Data Analytics 372 Parsing: Choose most likely tree

Probabilistic CFG S Probability of this tree=0.000015 S→ NP VP 1.0 NP VP NP → Det BNP 0.3 Det NP → BNP 0.4 BNP VP PP NP→ NP PP 0.3 Grammar A N Aux V NP PNP BNP→ N … VP → V chasing on VP → Aux V NP dog is … a boy VP → VP PP . the playground PP → P NP 1.0 . S Probability of this tree=0.000011 V → chasing 0.01 NP VP Aux→ is N → dog 0.003 Det BNP Aux V NP Lexicon N → boy PP N→ playground … A N is chasing NP Det→ the P NP … Det→ a dog a boy P → on on

Jacobs University the playground Visualization and Computer Graphics Lab

Data Analytics 373 9.9 Assignment

Jacobs University Visualization and Computer Graphics Lab Assignment 7 • Download the Jacobs University news archives http://www.jacobs-university.de/news-archive-2000-2008 http://www.jacobs-university.de/news-archive-2009-2015 • Convert the html to text files and separate the individual news items. The individual news items serve as documents. • Remove stop words and perform stemming. • Perform a frequency analysis to compute the term-document (TD) matrix. What are the most common terms? • Compute inverse-document frequency (IDF) and term importance (TI). What are now the most common terms? •Compute pairwise cosine and Euclidean distance between all documents. • Apply a multi-dimensional scaling approach to the distance matrix and render a 2D scatterplot. Compare the two distance metrics. • Bonus: Capture the year of release during parsing and color code the scatterplot by time. Produce a Word Cloud for each year. Jacobs University Visualization and Computer Graphics Lab

Data Analytics 375