<<

September 4, 2014 UNSTRUCTURED EXTRACTION VIA NATURAL LANGUAGE PROCESSING (NLP)

Presented by Alex Wu, Partner, Sagence, Inc.

2nd Annual INFORMS Midwest Practice of Conference University of Chicago’s Gleacher Center

CONFIDENTIAL Agenda

Intro ‒ About Sagence

Why NLP?

What is NLP?

Use cases & patterns

Q&A

2 | CONFIDENTIAL ABOUT SAGENCE Sagence is a specialized firm designed to help organizations drive their businesses with information and insight through better data and analytics practices

Our Value Our People Our Position

. Improving Decision Making & Innovation:

Helping clients create better business igh igh

outcomes by using their data more H effectively . Asking the Right Questions: Identifying the high impact analytics for a client’s combination of industry, marketplace, and strategy . Applying Decision Science: Generating hypotheses, collecting data,

analyzing, testing, and informing new Objectivity Objectivity Advice Strategic / decisions

. Developing Information Assets: Low Identifying, unleashing, delivering and Generalist Specialist Services maintaining the hidden value in data . Multi-disciplinary teams . Delivering Innovative Solutions: . Led by professionals with 25+ years of experience Helping clients with pragmatic Data Specialists . Providing objective thinking combined innovation that acknowledges business, with an interdisciplinary perspective Thinkers and Doers technical, economic, and cultural . Working collaboratively with clients Strategic and Objective constraints . Focused on business results

We assist companies with everything from broad strategic challenges to tackling point problems with expert resources and solutions

3 | CONFIDENTIAL INTRO Quick Survey on Natural Language Processing (NLP)

Where would you rate your NLP knowledge?

NEW TO NLP FAMILIAR WITH NLP SUBJECT MATTER EXPERT Experience with tools Start with the fundamentals and techniques In-depth discussions

4 | CONFIDENTIAL INTRO This presentation represents a solution to solving a commercial problem

WHAT WE WILL COVER WHAT WE WON’T

. Commercial problem and context . Discussion on pros/cons of theory . Survey of frameworks/tools . Detailed analysis of solutions . Select challenges encountered . In-depth discussion on topics in along with implemented solutions Natural Language Processing

5 | CONFIDENTIAL Agenda

Intro

Why NLP? ‒ Growth of ‒ Existing solutions & inefficiencies ‒ Value proposition

What is NLP?

Use cases & patterns

Q&A

6 | CONFIDENTIAL WHY NLP? The rapid growth of unstructured data magnifies the importance of finding the business value in that unstructured information

Growth of Unstructured Data

100 90 61.8% 80

70 CAGR for 60 unstructured data

50

40

30

20 23.7%

10 CAGR for

0 structured data 2007 2008 2009 2010 2011 2012 2013 2014 Structured Data Unstructured Data

7 | CONFIDENTIAL WHY NLP? Utilizing Natural Language Processing (NLP) as an approach to distilling business value from unstructured data yields many benefits

Volume Variety . Consistency . Accuracy . Reuse NLP BIG DATA . Flexibility . Knowledgebase (train once, use many times)

Velocity

8 | CONFIDENTIAL WHY NLP? Existing solutions largely fail at addressing the challenge of distilling business value from unstructured data

Unstructured Landscape

Operational inefficiencies Fragmented data Complex tool landscape & inability to respond to management solutions & issues emerging business needs

VS

Continuum of Techniques

Manual Entry Highly Automated

Natural Language Manual Data Entry Template Driven Advanced Search Processing (NLP)

9 | CONFIDENTIAL WHY NLP? In one client example, utilizing NLP realized large efficiency gains compared to existing manual processes

Speed Accuracy Consistency (minutes) (%) (%)

NLP Manual NLP Manual NLP Manual 100% 240 100% 90% 90% 80%

2 LOW

Processing Speed Accuracy - Extraction Accuracy - Consistency (minutes) Classification

10 | CONFIDENTIAL Agenda

Intro

Why NLP?

What is NLP? ‒ Definition ‒ Approach ‒ Tools & Landscape

Use cases & patterns

Q&A

11 | CONFIDENTIAL WHAT IS NLP? Some Definitions

“By speech and language processing, we have in mind those computational techniques that

process spoken and written human language, as language.” ~ Daniel Jurafsky and James H. Martin1

. Phonetics and Phonology – The study of linguistic sounds. . Morphology – The study of the meaningful components of words. . Syntax – The study of the structural relationships between words. . Semantics – The study of meaning. . Pragmatics – The study of how language is used to accomplish goals. . Discourse – The study of linguistic units larger than a single utterance.

Lexicon

Speech Phonology Morphology Syntax Semantics Logic

Pragmatics

1Speech and Language Processing - An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition

12 | CONFIDENTIAL WHAT IS NLP? NLP Steps

. Convert content to images and use OCR to convert images to text Pre- . Utilize proprietary algorithms to adjust for artifacts and discrepancies processing . Differentiate discourse from tabular structures and identify financial table boundaries

. Tokenize text, assign X,Y coordinate structure to each token and assemble as in-memory data grid as reference Syntax . Perform Part-of-Speech (POS) tagging and assemble Treebank (grammatical relationships of tokens within a sentence) . Traverse data grid and assemble ‘sentence’ structure for each data point

. Normalize extracted terms to key concepts Semantics . Apply Named Entity Recognition (NER) to ‘sentence’ to classify tokens . Use Semantic Vector rules (SV rules) to identify key mnemonics

. Assign semantic Predicate/Subject/Object (PSO) relationship to relate data point (Money) Pragmatics with surrounding context (Legal Entity, Statement Type, Statement Date, Line Item, Fund…etc.)

13 | CONFIDENTIAL WHAT IS NLP?

NLP tools and landscape Comprehensiveness

NLTK

Maturity

The landscape consists of both open source and commercial solutions. No single solution addresses the full data lifecycle.

14 | CONFIDENTIAL Agenda

Intro

Why NLP?

What is NLP?

Use cases & patterns ‒ Use cases ‒ Solution

Q&A

15 | CONFIDENTIAL USE CASES & PATTERNS The solution serves as the enterprise data collection and management pipeline into the ratings analytical process

. Global credit ratings agency . 450 person organization dedicated to data collection and governance . Automation of data collections results in . ~$20 million in savings over five years

Data Ratings Data Collection Publish Surveillance Governance Process

Data Lifecycle for Financial

16 | CONFIDENTIAL USE CASES & PATTERNS Document artifacts require preprocessing prior to NLP processing

Table spans across multiple pages Poorly scanned documents

Complex nested line items and subsections

17 | CONFIDENTIAL USE CASES & PATTERNS The solution must detect tables, identify boundaries, parse tokens and assign coordinate

Denote as table origin Text Assign x, y coordinate

Table Boundaries

Numerical Values

Identify null values Differentiate from table end

Denote as table end Assign x, y coordinate

18 | CONFIDENTIAL USE CASES & PATTERNS Targeted extraction of data points within table structures requires preservation of data lineage and context

1

2 3 4 Row Context 7 Column Context 8

5 Targeted Value 6

By combining the Row Context + Value + Column Context, a ‘sentence’ is formed for NLP

19 | CONFIDENTIAL USE CASES & PATTERNS Financial data extraction requires precise targeting of content with related context and data lineage

. Automatically crawl and retrieve documents . Extract and classify key mnemonics with preservation of context and lineage

20 | CONFIDENTIAL USE CASES & PATTERNS Other use cases include legal discovery and document

. ~$500 trillion dollars in notional amounts outstanding1 . Extract and classify key clauses based on NLP

1 http://www.isda.org/statistics/historical.html

21 | CONFIDENTIAL USE CASES & PATTERNS Other use cases include legal discovery and document data mining

. Automatically crawl and retrieve online postings . Extract and classify demand based on

Extraction Results

22 | CONFIDENTIAL USE CASES & PATTERNS A solution that addresses the full data lifecycle consists of four primary components

Sourcing Extraction Workflow & Distribution . Crawl . Preprocessing QA . File . Upload . NLP . Assignment . . Feeds Processing . Review . Post Processing

23 | CONFIDENTIAL USE CASES & PATTERNS The Sagence solution offers end-to-end integration with enterprise systems to allow data usability and exchange

Document

Backend Engine (JAVA File Web/FTP + Other Open Source Mount Sourcing Front End UI Components)

FTP Target NLP Engine Data Services Layer XML Distribution System

Entity Referential Document Legend Data Services Repository Services NLP Core Product

Enterpris Entity Data Document Relational e System Store Repository Database

BUSINESS CAPABILITIES: TECHNICAL FEATURES: . End-to-end workflow . Modular architecture . Handle multiple extraction use cases within a single document . Open source components . Configurable user dictionary . Extensibility to additional analytical tools . Capture document context using metadata tagging . Integrated data sourcing and retrieval (Web, FTP) . User configurable rules and knowledge base . Integration with reference data systems and custom feeds . Multiple data sourcing options (Web, FTP, internal feeds, manual upload) . Intelligent algorithms for data preprocessing . Multiple document types (PDF, scanned documents) . Distribution via XML or data services . Pre-extraction document data cleanup and adjustment . Customizable load balancing and work routing . User to manage workflow and ensure quality assurance . Conforms with enterprise standards . Data distribution to key systems and consumers . Deployable via custom environments, virtual machines, or cloud-based platforms . Stores documents within content management system

24 | CONFIDENTIAL THANK YOU

Alex Wu | Partner | Sagence Roger Moore | Senior Principal | Sagence mobile: +1.860.680.7913 mobile: +1.312.543.1319 [email protected] [email protected]

http://info.sagenceconsulting.com/natural-language-processing-isda APPENDIX The NLP engine utilizes out-of-the-box as well as user defined lexicons and rules to identify, classify, and extract targeted values

RuleSet.xml User defined rules IntraTokenRules.xml MultiRokenRules.xml

User defined dictionaries User.xml

Out-of-the-box dictionaries Lexicon(s) Override

Entities, Part- Of-Speech and Pragmatics TokenDefs.xml definitions

Rosoka Core Engine*

http://rosoka.com/home/products/rosoka/rosoka-nlp/ 26 | CONFIDENTIAL