Semantic Analysis Lexical Analysis
Total Page:16
File Type:pdf, Size:1020Kb
9. Text and Document Analytics Jacobs University Visualization and Computer Graphics Lab Unstructured Data Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo bought <a href>Frank Rizzo Loanee: Frank Rizzo his home from Lake </a> Bought Lender: MWF View Real Estate in <a hef>this home</a> Agency: Lake View 1992. from <a href>Lake Amount: $200,000 He paid $200,000 View Real Estate</a> Term: 15 years under a15-year loan In <b>1992</b>. ) Loans($200K,[map],...) from MW Financial. <p>... Jacobs University Visualization and Computer Graphics Lab Data Analytics 312 Unstructured Data • There are many examples of text-based documents (all in ‘electronic’ format…) – e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories, etc. • Not enough time or patience to read – Can we extract the most vital kernels of information? • Find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first Jacobs University Visualization and Computer Graphics Lab Data Analytics 313 Tasks Jacobs University Visualization and Computer Graphics Lab Data Analytics 314 9.1 Text Mining Jacobs University Visualization and Computer Graphics Lab Text Mining Jacobs University Visualization and Computer Graphics Lab Data Analytics 316 Text Mining • Typically falls into one of two categories: – Analysis of text: I have a bunch of text I am interested in, tell me something about it • E.g. sentiment analysis, “buzz” searches –Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E.g. web search, library catalogs, legal and medical precedent studies Jacobs University Visualization and Computer Graphics Lab Data Analytics 317 Text Mining: Analysis • Which words are most present • Which words are most surprising •Which words help define the document • What are the interesting text phrases Jacobs University Visualization and Computer Graphics Lab Data Analytics 318 Text Mining: Retrieval • Find k objects in the corpus of documents which are most similar to my query. • Can be viewed as “interactive” data mining - query not specified a priori. • Main problems of text retrieval: – What does “similar” mean? – How do I know if I have the right documents? – How can I incorporate user feedback? Jacobs University Visualization and Computer Graphics Lab Data Analytics 319 9.2 Text Retrieval Jacobs University Visualization and Computer Graphics Lab Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance between two sentences or queries? • Evaluating retrieval is hard: what is the “right” answer? (no ground truth) • User can query things you have not seen before e.g. misspelled, foreign, new terms. • Goal (score function) is different than in classification/regression: not looking to model all of the data, just get best results for a given user. • Words can hide semantic content – Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining –Polysemy: The same keyword may mean different things in different contexts, e.g., mining Jacobs University Visualization and Computer Graphics Lab Data Analytics 321 Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) |{Relevant}∩{Retrieved}| precision= |{Retrieved}| •Recall:the percentage of documents that are relevant to the query and were, in fact, retrieved |{Relevant}∩{Retrieved}| recall= |{Relevant}| Jacobs University Visualization and Computer Graphics Lab Data Analytics 322 Precision vs. Recall Truth:Relevant Truth:Not Relevant actual outcome Algorithm:Relevant TP FP 10 Algorithm: Not FN TN predicted 1 Relevant outcome ab – Precision = TP/(TP+FP) 0 cd – Recall = TP/(TP+FN) – Trade off: • If algorithm is ‘picky’: precision high, recall low • If algorithm is ‘relaxed’: precision low, recall high – BUT: recall often hard if not impossible to calculate Jacobs University Visualization and Computer Graphics Lab Data Analytics 323 Precision Recall Curves •If we have a labeled training set, we can calculate recall. • For any given number of returned documents, we can plot a point for precision vs recall • Different retrieval algorithms might have very different curves - hard to tell which is “best” Jacobs University Visualization and Computer Graphics Lab Data Analytics 324 9.3 Frequency Analysis Jacobs University Visualization and Computer Graphics Lab Bag-of-Tokens Approaches Documents Token Sets Four score and seven years nation – 5 ago our fathers brought civil - 1 forth on this continent, a war – 2 new nation, conceived in Feature men – 2 Liberty, and dedicated to Extraction died – 4 the proposition that all men people – 5 are created equal. Liberty – 1 Now we are engaged in a God – 1 great civil war, testing … whether that nation, or … Loses all order-specific information! Severely limits context! Jacobs University Visualization and Computer Graphics Lab Data Analytics 326 Term-document Matrix • Most common form of representation in text mining is the term-document matrix – Term: typically a single word, but could be a word phrase like “data mining” – Document: a generic term meaning a collection of text to be retrieved – Can be large - terms are often 50k or larger, documents can be in the billions (www). – Can be binary, or use counts Jacobs University Visualization and Computer Graphics Lab Data Analytics 327 Term-document Matrix Example: 10 documents, 6 terms Database SQL Index Regression Likelihood linear D124219003 D232105030 D312165000 D4672000 D5 43 31 20 0 3 0 D62001876 D7 0 0 1 32 12 0 D83002244 D91 0 0 342725 D1060017423 Each document now is just a vector of terms, sometimes Boolean Jacobs University Visualization and Computer Graphics Lab Data Analytics 328 Term-document Matrix • We have lost all semantic content • Be careful constructing your term list! – Not all words are equal – Words that are the same should be treated the same •Stop words •Stemming Jacobs University Visualization and Computer Graphics Lab Data Analytics 329 Stop Words • Many of the most frequently used words in English are worthless in retrieval and text mining – these words are called stop words. – the, of, and, to, …. – Typically about 400 to 500 such words – For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? – Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. – Improve effectiveness • stop words are not useful for searching or text mining • stop words always have a large number of hits Jacobs University Visualization and Computer Graphics Lab Data Analytics 330 Stemming • Techniques used to find out the root/stem of a word: – E.g., user engineering users engineered used engineer using – stem: use engineer • Usefulness: – improving effectiveness of retrieval and text mining • matching similar words – reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. Jacobs University Visualization and Computer Graphics Lab Data Analytics 331 Basic stemming methods • remove ending – if a word ends with a consonant other than s, followed by an s, then delete s. – if a word ends in es, drop the s. – if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. – if a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. – …... • transform words – if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” Jacobs University Visualization and Computer Graphics Lab Data Analytics 332 Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms – Even after stemming and stopword removal. • Greedy search – Start from full set and delete one at a time – Find the least important variable • Often performance does not degrade even with orders of magnitude reductions Jacobs University Visualization and Computer Graphics Lab Data Analytics 333 9.4 Comparisons Jacobs University Visualization and Computer Graphics Lab Distances of documents • Interpret documents as strings. • Then, we can compute how closely to documents match each other using the Levenshtein distance. • The Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. • Formally, lev(|a|,|b|) is computed by • However, documents we are looking into are typically not that close and the Levenshtein distances would not produce meaningful results when comparing a larger set of documents. Jacobs University Visualization and Computer Graphics Lab Data Analytics 335 Distances in TD matrices • Given a term-document matrix representation, we can define distances between documents • Elements of matrix can be 0,1 or term frequencies (sometimes normalized) • Can use Euclidean or cosine distance • Cosine distance is the angle between the two vectors • If docs are the same, dc=1, if nothing in common, dc=0 • Maybe not so intuitive, but has been proven to work well for sparse data sets • Term-document matrices are typically sparse Jacobs University Visualization and Computer Graphics Lab Data Analytics 336 Distances in TD matrices • We can calculate cosine and Euclidean