September 4, 2014 UNSTRUCTURED DATA EXTRACTION VIA NATURAL LANGUAGE PROCESSING (NLP)
Presented by Alex Wu, Partner, Sagence, Inc.
2nd Annual INFORMS Midwest Practice of Analytics Conference University of Chicago’s Gleacher Center
CONFIDENTIAL Agenda
Intro ‒ About Sagence
Why NLP?
What is NLP?
Use cases & patterns
Q&A
2 | CONFIDENTIAL ABOUT SAGENCE Sagence is a specialized firm designed to help organizations drive their businesses with information and insight through better data and analytics practices
Our Value Our People Our Position
. Improving Decision Making & Innovation:
Helping clients create better business igh igh
outcomes by using their data more H effectively . Asking the Right Questions: Identifying the high impact analytics for a client’s combination of industry, marketplace, and strategy . Applying Decision Science: Generating hypotheses, collecting data,
analyzing, testing, and informing new Objectivity Objectivity Advice Strategic / decisions
. Developing Information Assets: Low Identifying, unleashing, delivering and Generalist Specialist Services maintaining the hidden value in data . Multi-disciplinary teams . Delivering Innovative Solutions: . Led by professionals with 25+ years of experience Helping clients with pragmatic Data Specialists . Providing objective thinking combined innovation that acknowledges business, with an interdisciplinary perspective Thinkers and Doers technical, economic, and cultural . Working collaboratively with clients Strategic and Objective constraints . Focused on business results
We assist companies with everything from broad strategic challenges to tackling point problems with expert resources and solutions
3 | CONFIDENTIAL INTRO Quick Survey on Natural Language Processing (NLP)
Where would you rate your NLP knowledge?
NEW TO NLP FAMILIAR WITH NLP SUBJECT MATTER EXPERT Experience with tools Start with the fundamentals and techniques In-depth discussions
4 | CONFIDENTIAL INTRO This presentation represents a solution to solving a commercial problem
WHAT WE WILL COVER WHAT WE WON’T
. Commercial problem and context . Discussion on pros/cons of theory . Survey of frameworks/tools . Detailed analysis of solutions . Select challenges encountered . In-depth discussion on topics in along with implemented solutions Natural Language Processing
5 | CONFIDENTIAL Agenda
Intro
Why NLP? ‒ Growth of unstructured data ‒ Existing solutions & inefficiencies ‒ Value proposition
What is NLP?
Use cases & patterns
Q&A
6 | CONFIDENTIAL WHY NLP? The rapid growth of unstructured data magnifies the importance of finding the business value in that unstructured information
Growth of Unstructured Data
100 90 61.8% 80
70 CAGR for 60 unstructured data
50
40
30
20 23.7%
10 CAGR for
0 structured data 2007 2008 2009 2010 2011 2012 2013 2014 Structured Data Unstructured Data
7 | CONFIDENTIAL WHY NLP? Utilizing Natural Language Processing (NLP) as an approach to distilling business value from unstructured data yields many benefits
Volume Variety . Consistency . Accuracy . Reuse NLP BIG DATA . Flexibility . Knowledgebase (train once, use many times)
Velocity
8 | CONFIDENTIAL WHY NLP? Existing solutions largely fail at addressing the challenge of distilling business value from unstructured data
Unstructured Data Management Landscape
Operational inefficiencies Fragmented data Complex tool landscape & inability to respond to management solutions & data quality issues emerging business needs
VS
Continuum of Techniques
Manual Entry Highly Automated
Natural Language Manual Data Entry Template Driven Advanced Search Processing (NLP)
9 | CONFIDENTIAL WHY NLP? In one client example, utilizing NLP realized large efficiency gains compared to existing manual processes
Speed Accuracy Consistency (minutes) (%) (%)
NLP Manual NLP Manual NLP Manual 100% 240 100% 90% 90% 80%
2 LOW
Processing Speed Accuracy - Extraction Accuracy - Consistency (minutes) Classification
10 | CONFIDENTIAL Agenda
Intro
Why NLP?
What is NLP? ‒ Definition ‒ Approach ‒ Tools & Landscape
Use cases & patterns
Q&A
11 | CONFIDENTIAL WHAT IS NLP? Some Definitions
“By speech and language processing, we have in mind those computational techniques that
process spoken and written human language, as language.” ~ Daniel Jurafsky and James H. Martin1
. Phonetics and Phonology – The study of linguistic sounds. . Morphology – The study of the meaningful components of words. . Syntax – The study of the structural relationships between words. . Semantics – The study of meaning. . Pragmatics – The study of how language is used to accomplish goals. . Discourse – The study of linguistic units larger than a single utterance.
Lexicon
Speech Phonology Morphology Syntax Semantics Logic
Pragmatics
1Speech and Language Processing - An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition
12 | CONFIDENTIAL WHAT IS NLP? NLP Steps
. Convert content to images and use OCR to convert images to text Pre- . Utilize proprietary algorithms to adjust for artifacts and discrepancies processing . Differentiate discourse from tabular structures and identify financial table boundaries
. Tokenize text, assign X,Y coordinate structure to each token and assemble as in-memory data grid as reference Syntax . Perform Part-of-Speech (POS) tagging and assemble Treebank (grammatical relationships of tokens within a sentence) . Traverse data grid and assemble ‘sentence’ structure for each data point
. Normalize extracted terms to key concepts Semantics . Apply Named Entity Recognition (NER) to ‘sentence’ to classify tokens . Use Semantic Vector rules (SV rules) to identify key mnemonics
. Assign semantic Predicate/Subject/Object (PSO) relationship to relate data point (Money) Pragmatics with surrounding context (Legal Entity, Statement Type, Statement Date, Line Item, Fund…etc.)
13 | CONFIDENTIAL WHAT IS NLP?
NLP tools and landscape Comprehensiveness
NLTK
Maturity
The landscape consists of both open source and commercial solutions. No single solution addresses the full data lifecycle.
14 | CONFIDENTIAL Agenda
Intro
Why NLP?
What is NLP?
Use cases & patterns ‒ Use cases ‒ Solution
Q&A
15 | CONFIDENTIAL USE CASES & PATTERNS The solution serves as the enterprise data collection and management pipeline into the ratings analytical process
. Global credit ratings agency . 450 person organization dedicated to data collection and governance . Automation of data collections results in . ~$20 million in savings over five years
Data Ratings Data Collection Publish Surveillance Governance Process
Data Lifecycle for Financial Data Extraction
16 | CONFIDENTIAL USE CASES & PATTERNS Document artifacts require preprocessing prior to NLP processing
Table spans across multiple pages Poorly scanned documents
Complex nested line items and subsections
17 | CONFIDENTIAL USE CASES & PATTERNS The solution must detect tables, identify boundaries, parse tokens and assign coordinate metadata
Denote as table origin Text Assign x, y coordinate
Table Boundaries
Numerical Values
Identify null values Differentiate from table end
Denote as table end Assign x, y coordinate
18 | CONFIDENTIAL USE CASES & PATTERNS Targeted extraction of data points within table structures requires preservation of data lineage and context
1
2 3 4 Row Context 7 Column Context 8
5 Targeted Value 6
By combining the Row Context + Value + Column Context, a ‘sentence’ is formed for NLP
19 | CONFIDENTIAL USE CASES & PATTERNS Financial data extraction requires precise targeting of content with related context and data lineage
. Automatically crawl and retrieve documents . Extract and classify key mnemonics with preservation of context and lineage
20 | CONFIDENTIAL USE CASES & PATTERNS Other use cases include legal discovery and document data mining
. ~$500 trillion dollars in notional amounts outstanding1 . Extract and classify key clauses based on NLP
1 http://www.isda.org/statistics/historical.html
21 | CONFIDENTIAL USE CASES & PATTERNS Other use cases include legal discovery and document data mining
. Automatically crawl and retrieve online postings . Extract and classify demand based on machine learning
Extraction Results
22 | CONFIDENTIAL USE CASES & PATTERNS A solution that addresses the full data lifecycle consists of four primary components
Sourcing Extraction Workflow & Distribution . Crawl . Preprocessing QA . File . Upload . NLP . Assignment . Database . Feeds Processing . Review . Post Processing
23 | CONFIDENTIAL USE CASES & PATTERNS The Sagence solution offers end-to-end integration with enterprise systems to allow data usability and exchange
Document
Backend Engine (JAVA File Web/FTP + Other Open Source Mount Sourcing Front End UI Components)
FTP Target NLP Engine Data Services Layer XML Distribution System
Entity Referential Document Legend Data Services Repository Services NLP Core Product
Enterpris Entity Data Document Relational e System Store Repository Database
BUSINESS CAPABILITIES: TECHNICAL FEATURES: . End-to-end workflow . Modular architecture . Handle multiple extraction use cases within a single document . Open source components . Configurable user dictionary . Extensibility to additional analytical tools . Capture document context using metadata tagging . Integrated data sourcing and retrieval (Web, FTP) . User configurable rules and knowledge base . Integration with reference data systems and custom feeds . Multiple data sourcing options (Web, FTP, internal feeds, manual upload) . Intelligent algorithms for data preprocessing . Multiple document types (PDF, scanned documents) . Distribution via XML or data services . Pre-extraction document data cleanup and adjustment . Customizable load balancing and work routing . User dashboard to manage workflow and ensure quality assurance . Conforms with enterprise standards . Data distribution to key systems and consumers . Deployable via custom environments, virtual machines, or cloud-based platforms . Stores documents within content management system
24 | CONFIDENTIAL THANK YOU
Alex Wu | Partner | Sagence Roger Moore | Senior Principal | Sagence mobile: +1.860.680.7913 mobile: +1.312.543.1319 [email protected] [email protected]
http://info.sagenceconsulting.com/natural-language-processing-isda APPENDIX The NLP engine utilizes out-of-the-box as well as user defined lexicons and rules to identify, classify, and extract targeted values
RuleSet.xml User defined rules IntraTokenRules.xml MultiRokenRules.xml
User defined dictionaries User.xml
Out-of-the-box dictionaries Lexicon(s) Override
Entities, Part- Of-Speech and Pragmatics TokenDefs.xml definitions
Rosoka Core Engine*
http://rosoka.com/home/products/rosoka/rosoka-nlp/ 26 | CONFIDENTIAL