<<

Text Analytics World Current Applications and Future Directions of Text Analytics

Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional Services http://www.kapsgroup.com Agenda

§ Introduction: – Current State of Text Analytics – Survey / Discussion Themes § Enterprise Text Analytics - Search – still fundamental – Shift from information to business § Social Media – Next Generation – Text Analytics and CRM § Integration – Text and Data, Enterprise and Social § Future of Text Analytics – Roadblocks, Deep Vision § Questions

2 Introduction: KAPS Group

§ Knowledge Architecture Professional Services – Network of Consultants § Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies § Services: – Strategy – IM & KM - Text Analytics, Social Media, Integration – Taxonomy/Text Analytics development, consulting, customization – Text Analytics Quick Start – Audit, Evaluation, Pilot – Social Media: Text based applications – design & development § Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics § Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A. § Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc. 3 § Presentations, Articles, White Papers – www.kapsgroup.com Text Analytics World Current State of Text Analytics § History – academic research, focus on NLP § Inxight –out of Zerox Parc – Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data § Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends – Half from 2008 are gone - Lucky ones got bought § Early applications – News aggregation and Enterprise Search – § Second Wave = shift to sentiment analysis § Enterprise search – 30-50% of market ($1Bil) § Text Analytics is growing 20% a year, 10% of analytics § Fragmented market – no clear leader

4 Text Analytics World Current State of Text Analytics: Vendor Space § Taxonomy Management – SchemaLogic, Pool Party § From Taxonomy to Text Analytics – Data Harmony, Multi-Tes § Extraction and Analytics – Linguamatics (Pharma), Temis, whole range of companies § Business Intelligence – Clear Forest, Inxight § Sentiment Analysis – Attensity, Lexalytics, Clarabridge § Open Source – GATE § Stand alone text analytics platforms – IBM, SAS, SAP, Smart Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching § Embedded in Content Management, Search – Autonomy, FAST, Endeca, Exalead, etc. 5 Interviews with Leading Vendors, Analysts: Current Trends § From Mundane to Advanced – reducing manual labor to “Cognitive Computing” § Enterprise – Shift from Information to Business – cost cutting rather than productivity gains § Integration – data and text, text analytics and analytics – Social Media – explosion of wild text, combine with data – customer browsing behavior, web analytics § – more focus on extraction (where it began) but categorization adds depth and sophistication § Shift away from IT – compliance, legal, advertising, CRM § US market different than Europe/Asia – project oriented

6

Enterprise Text Analytics

§ Search is still #1 = 30-50% of applications § New Standard Search – facets (more and more metadata), auto- categorization built on taxonomies, clustering – Issue – consistent metadata, multiple content sources § Trend = Text Analytics/Search as Semantic Infrastructure – Platform for Info Apps (Search-based applications) § SharePoint – Major focus of TA companies – fix problems with taxonomy/folksonomy – Hybrid workflow – Publish document -> TA analysis -> suggestions for categorization, entities, metadata -> present to author § External information = more automation, extraction – precision more important § Use of predictive facets, enhanced relevance (Fast)

7 Enterprise Text Analytics Adding Structure to Unstructured Content § Beyond Documents – categorization by corpus, by page, sections or even sentence or phrase § Documents are not unstructured – variety of structures – Sections – Specific - “Abstract” to Function “Evidence” – Corpus – document types/purpose – Textual complexity, level of generality § Need to develop flexible categorization and taxonomy – tweets to 200 page PDF § Applications require sophisticated rules, not just categorization by similarity

8 9 Enterprise Text Analytics Document Type Rules § (START_2000, (AND, (OR, _/article:"[Abstract]", _/ article:"[Methods]“), (OR,_/article:"clinical trial*", _/ article:"humans", § (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/ article:"use", _/article:"animals"), § If the article has sections like Abstract or Methods § AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score § Primary issue – major mentions, not every mention – Combination of noun phrase extraction and categorization – Results – virtually 100%

10 Enterprise Text Analytics Building on the Foundation: Applications § Focus on business value, cost cutting § Enhancing information access is means, not an end – Governance, Records Management, Doc duplication, Compliance – Applications – Business Intelligence, CI, Behavior Prediction – eDiscovery, litigation support – Risk Management – Productivity / Portals – spider and categorize, extract – KM communities & knowledge bases • New sources – field notes into expertise, – capture real time, own language-concepts

11

Enterprise Text Analytics: Applications Pronoun Analysis: Fraud Detection; Enron Emails § Patterns of “Function” words reveal wide range of insights § Function words = pronouns, articles, prepositions, conjunctions, etc. – Used at a high rate, short and hard to detect, very social, processed in the brain differently than content words § Areas: sex, age, power-status, personality – individuals and groups § Lying / Fraud detection: Documents with lies have: – Fewer, shorter words, fewer conjunctions, more positive emotion words – More use of “if, any, those, he, she, they, you”, less “I” § Current research – 76% accuracy in some contexts – Italian – stylometry – linguistic hedges § Text Analytics can improve accuracy and utilize new sources § Data analytics (standard AML) can improve accuracy

12 Social Media: Next Generation Beyond Simple Sentiment § Beyond Good and Evil (positive and negative) – Degrees of intensity, complexity of emotions and documents § Importance of Context – around positive and negative words – Rhetorical reversals – “I was expecting to love it” – Issues of sarcasm, (“Really Great Product”), slanguage § Essential – need full categorization and concept extraction § New Taxonomies – Appraisal Groups – “not very good” – Supports more subtle distinctions than positive or negative § Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust – New Complex – pride, shame, confusion, skepticism § New conceptual models, models of users, communities

13

Social Media: Next Generation Behavior Prediction – Telecom Customer Service

§ Problem – distinguish customers likely to cancel from mere threats § Basic Rule

– (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"), – (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”))))) § Examples: – customer called to say he will cancell his account if the does not stop receiving a call from the ad agency. – cci and is upset that he has the asl charge and wants it off or her is going to cancel his act § More sophisticated analysis of text and context in text § Combine text analytics with Predictive Analytics and traditional behavior monitoring for new applications

14 Social Media: Next Generation Variety of New Applications § Crowd Sourcing Technical Support – User Forums – find problem area, nearby text for solution – Automatic or Human mediated § Legal Review – Significant trend – computer-assisted review (manual =too many) – TA- categorize and filter to smaller, more relevant set – Payoff is big – One firm with 1.6 M docs – saved $2M § Financial Services – Trend – using text analytics with predictive analytics – risk and fraud – Combine unstructured text (why) and transaction data (what) – Customer Relationship Management, Fraud Detection – Stock Market Prediction – Twitter, impact articles

15

Text Analytics: New Directions Integration § Text and Data, Internal and External, Enterprise and Social § Focus - multiple approaches are needed and multiple ways to combine – Death to the Dichotomies – All of the Above § Massive parallelism or deeply integrated solution – Example of Watson - fast filtering to get to best 100 answers, then deep analysis of 100 § Role of automatic / human § CRM – struggle to connect to enterprise – Have to learn to speak “enterprise” § Imply – Sentiment analysis focus for companies not enough § Enterprise and Social Media (Delve) – Social Media analysis and news aggregation 16 Delve for the Web: The Front Page of

Social media data from Twitter powers recommen dation algorithms.

Users follow topics, people, and companies selected from Delve taxonomies. Text Analytics: New Directions - Integration Thinking Fast and Slow – Daniel Kahneman § System 1 – fast and automatic – little conscious control § Represents categories as prototypes – stereotypes – Norms for immediate detection of anomalies – distinguish the surprising from the normal – fast detection of simple differences, detect hostility in a voice, find best chess move (if a master) – Priming / Anchoring – susceptible to systemic errors – Biased to believe and confirm – Focuses on existing evidence (ignores missing – WYSIATI) § .

18

Text Analytics: New Directions - Integration Thinking Fast and Slow § System 2 – Complex, effortful judgments and calculations – System 2 is the only one that can follow rules, compare objects on several attributes, and make deliberate choices – Understand complex sentences, validity of logical argument – Focus attention – can make people blind to all else – Invisible Gorilla § Similar to traditional dichotomies – Tacit – Explicit, etc § Basic Design – System 1 is basic to most experiences, and System 2 takes over when things get difficult – conscious control § Text Analysis and Text Mining / Auto-Cat and TA Cat

19

Text Analytics: New Directions - Integration System 1 & 2 – and Text Analytics Approaches § “Automatic Categorization” – System 1 prototypes – Limited value -- only works in simple environments – Shallow categories with large differences – Not open to conscious control § System 2 – categories – complex, minute differences, deep categories § Together: – Choose one or other for some contexts – Combine both – need to develop new kinds of categories and/ or new ways to combine?

20

Text Analytics: New Directions - Integration Text Mining and Text Analytics § Text Analytics and Big Data enrich each other – Data tells you what people did, TA tells you why § Text Analytics – pre-processing for TM – Discover additional structure in unstructured text – New variables for Predictive Analytics, Social Media Analytics – New dimensions – 90% of information, 50% using Twitter analysis § Text Mining for TA– Semi-automated taxonomy development – Apply data methods, predictive analytics to unstructured text – New Models – Watson ensemble methods, reasoning apps § Extraction – smarter extraction – sections of documents, Boolean, advanced rules – drug names, adverse events – major mention

21

Text Analytics: New Directions - Integration Integration – Text Analytics and CRM § Overall – growing demand for natural language processing, TA – Identify when a customer is angry or at risk of closing an account – Growth of regulatory compliance requirements is driving – Used to understand why people call and whether they were satisfied with the quality of the experience, diagnose issues and address them – Combine with Web analytics – need an integrated system § Contact Center Search – searching and analyzing customer data across multiple channels – Integration – Salesforce, Coveo, eGain, InQuira § Enterprise Feedback Management ––want to track satisfaction and loyalty – issue of unstructured content social media, multimedia channels § Contact Center Infrastructure – Importance of Cloud based – Services and Infrastructure – Need Semantic Infrastructure – Cisco – Packaged Contact Center Enterprise § Web Support – virtual agents – deliver one answer to a customer’s question, not search results list – Missing – integrated knowledge management system 22 Future of Text Analytics Obstacles - Survey Results § What factors are holding back adoption of TA? – Lack of clarity about TA and business value - 47% – Lack of senior management buy-in - 8.5% § Need articulated strategic vision and immediate practical win § Issue – TA is strategic, US wants short term projects – Sneak Project in, then build infrastructure – difficulty of speaking enterprise § Integration Issue – who owns infrastructure? IT, Library, ? – IT understands infrastructure, but not text – Need interdisciplinary collaboration – Stanford is offering English- Computer Science Degree – close, but really need a library- computer science degree

23

Future of Text Analytics Primary Obstacle: Complexity § Usability of software is one element § More important is difficulty of conceptual-document models – Language is easy to learn , hard to understand and model § Need to add more intelligence (semantic networks) and ways for the system to learn – social feedback § Customization – Text Analytics– heavily context dependent – Content, Questions, Taxonomy-Ontology – Level of specificity – Telecommunications – Specialized vocabularies, acronyms

24 New Directions in Text Analytics Conclusions § Text Analytics is growing out (20%) and up – more mature applications and technique § Find the right balance of infrastructure and application focus § Essential theme – integration – text and data, enterprise and social § Big obstacles remain – Strategic Vision of text analytics in the enterprise – Concrete and quick application to drive acceptance § Future – Women, Fire, and Dangerous Things – Text Analytics and = Metaphor Analysis, deep language understanding, common sense?

25 Questions?

Tom Reamy [email protected] KAPS Group http://www.kapsgroup.com Upcoming: Text Analytics World SF - 2015 Workshop on Text Analytics: Enterprise Search Summit – New York, May 12-14 Taxonomy Boot Camp, ESS, KMWorld -DC, Nov 4-7 Fall Announcement!