DATA ENGINEERING & DATA MINING
Exploring Trends and Approaches from Inside and Outside of Our Industry Data Science Series, Part 3 of 4 21 August 2019
#ILTACON19 Data Science Series Schedule
Monday Tuesday Wednesday Thursday @ 3:30 PM @ 3:30 PM @ 3:30 PM @ 11:30 AM
Session 1 Session 2 Session 3 Session 4
LAW in the ERA of The RISE of the LEGAL DATA ENGINEERING and DATA VISUALIZATION in LEGAL DATA SCIENCE DATA SCIENCE TEAM DATA MINING the LEGAL INDUSTRY
Ed Walters, Dazza Carmin Ballou, Eric J. Lisa Mayo, Shree Melaina Fireman, Greenwood Felsberg, Mike Klastava Bharadwaj Andrew P. Medeiros, Mark Thorogood SPEAKERS/MODERATOR
Shreenidhi Bharadwaj Lisa Mayo ANDREW BAKER (Moderator) VP, Data & Analytics Director of Data Senior Director, Adjunct Professor Management Digital Services + University of Chicago Ballard Spahr LLP Analytics [email protected] [email protected] HBR Consulting SESSION NEED Data Engineering + Data Mining Management Data Engineering + Data Mining Data Management
How we store, stage, prep and How we explore data and begin to ready data for consumption derive meaning, insights and value from those assets DATA ENGINEERING THE DATA REVOLUTION
• Data is changing the world. Data to this century is what oil was for the last century - A driver for growth, change, and success.
David Parkins https://www.youtube.com/watch?v=4ycC0DJqrpc https://leewardcapitalmgt.com/the-economist-the-worlds-most-valuable-resource-is-no-longer-oil-but-data/ BIG DATA
• Facebook: stores 400 PB data, with an incoming daily rate of about 600 TB. (as of 2017) • YouTube: 1000 PB video storage, 100 M views/day • Google: 4M searches/minute, stores 10 EB data(estimation) • AT&T: 1.9 T phone call records, 70,000 calls/second • US Credit cards: 1.4 B cards, 20 B transactions/year • Your Law Firm: 1 Bazillion Documents DATA-DRIVEN STUDY IN LEGAL INDUSTRY
• Critical inputs are overlooked and suggests that many law firms may be missing data-oriented opportunities for growth – Expanding client base – Billing more hours, etc. • Are firms missing opportunities to improve the practice of law itself?
https://www.clio.com/resources/legal-trends/2018-report/ DATA LIFECYCLE: ENABLING BUSINESS GROWTH
• Data Lifecycle Management (DLM) is a process that helps organizations manage the flow of data throughout its lifecycle—from creation, to use, to sharing, archive and deletion. Analyze Share
Capture Curate Store Aggregate Iterate Archive
Create Enrich Secure ENTERPRISE DATA MANAGEMENT
• Holistic framework comprising the people, processes and technology that optimizes data from a variety of different sources, then makes it available when and where it’s needed ( harmonization )
https://www.firstsanfranciscopartners.com/data-management/ ALIGNING BUSINESS STRATEGY & DATA STRATEGY
• A Successful Data Strategy links Business Goals with Technology Solutions
https://globaldatastrategy.com/our-services/enterprise-data-strategy/ DATA ENGINEERING
• Aspect of data science that Ingest/ focuses on automation of Extract practical applications of data collection, curation, analysis and Analyze/ Prepare/ delivery in batch as well as in Deliver Clean near real time.
Store/ Organize DATA FORMATS
Numerical Text
Media – audio, video Geospatial POPULAR DATABASES
Database Type Database Names Relational Key-Value Column Document
Graph OPERATIONAL VERSUS HISTORICAL DATA
Turns the Wheel of the Organization Operational (OLTP)
Databases
Analytical (OLAP)
Watch the Wheels of the Organization DATA PIPELINE
ETL Reporting/ Data Warehouse BI Users/ Analysts Data Marts Operational Databases DATA WAREHOUSE
• The data warehouse is a Informational structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. The purpose of the data Enterprise Data Warehouse(EDW) warehouse is the retrieval of analytical information. A data warehouse can store detailed and/or summarized data. Analytical Data Mining DATA MARTS
• Subset of data from a data warehouse • Confined to data specific to a single line of business or department e.g. Finance or Marketing • Features: – Subject oriented – Small in size (few tables) – Customized by department – Source is departmentally structured data warehouse EXTRACT, TRANSFORM, LOAD (ETL)
• Creating ETL infrastructure is often the most time and resource- consuming part of the data warehouse development process
https://dzone.com/ DATA LAKE
• “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ DATA LAKE ARCHITECTURE
https://dzone.com/ DATA LAKE VS DATA WAREHOUSE
Attribute Data Warehouse Data Lake Schema Schema on write (predefined schemas) Schema on read (no predefined schemas) Scale Scales to large volumes at moderate cost - limited Scales to huge volumes at low cost - tens of thousands of storage and number of server nodes compute nodes Access methods Accessed through standardized SQL and BI tools Accessed through SQL-like systems, programs, and other methods
Workloads Batch processing, concurrent users performing Batch processing, stream processing, predictive analytics, improved interactive Analytics capability over EDWs for interactive queries New data Time consuming to introduce new content Fast ingestion of new data/content
Cost/efficiency Efficiently uses CPU/IO. Efficiently uses storage and processing capabilities at very low cost. Data Retention Limited - driven by retention policies Potential to retain all data (subject to retention policies) Users Reporting, Business Intelligence users Analytics, Data Scientists, Data Engineers
Key Benefits Provides a single enterprise wide view of data from Allows usage of raw structured and unstructured data from a centralized multiple sources low-cost store EXTRACT, LOAD, TRANSFORM (ELT)
• Loading of the extracted data, into a single, centralized data repository enabling unlimited access to all of the data at any time
https://dzone.com/ DOCUMENT DATABASES
• Data stored as documents ( multiple key-value pairs ) • Inherently a subclass of the key-value store Documents • Stores all information for a given object in a single instance in the database DATA MODELS
Tabular (Relational) Data Model Document Data Model Related data split across multiple records and tables Related data contained in a single, rich document DISCRETE TO CONNECTED DATA
RDBMS Hadoop / |<———————- Graph Database & ———————>| & EDW/ Graph Compute Engine Aggregate-Oriented Columnar RDBMS (Graph Transactions & Analytics) NoSQL Illustration by David Somerville based on the original by Hugh McLeod (@gapingvoid) GRAPH DATABASES Small network of Twitter users • Graph is just a collection of vertices and edges - or, in less intimidating language, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. A small social graph Five graphs in the world of business—social, intent, consumption, interest, and mobile - Ability to leverage these graphs provides a “sustainable competitive advantage.” - Gartner Graph Databases 2nd Edition : By Ian Robinson, Jim Webber, and Emil Eifrém THE LABELED PROPERTY GRAPH MODEL
• A labeled property graph has the following characteristics: – Contains nodes and relationships. – Nodes contain properties (key-value pairs). – Nodes can be labeled with one or more labels. – Relationships are named and directed, and always have a start and end node. – Relationships can also contain properties. RELATIONAL VERSUS GRAPH MODELS
• No more tables, no more foreign keys, no more joins Graph Model
SHREE Terminator
Titanic RATED SHREE Toy Story
Person Ratings Movie GRAPH DATABASE (NEO4J)
• Database management system with Create, Read, Update and Delete (CRUD) operations working on a graph data model. • Generally built for use with On line transaction processing (OLTP) systems.
Relationships are first-class citizens of the graph data model NOW, LET’S EXPLORE THESE CONCEPTS AS APPLIED TO A LAW FIRM ADVANCED DATA MANAGEMENT IN PRACTICE ENTERPRISE DATA WAREHOUSE PROJECT: WHAT… WHY… HOW… DRIVERS FOR CHANGE/INVESTMENT
Issues: • Loads of data from disparate systems each with its own metadata • No firm-wide tool to connect the dots • We needed greater agility for system conversions DRIVERS FOR CHANGE/INVESTMENT
Realization: How we manage, analyze, and leverage data to drive quality and add value will: • Differentiate us in a demanding legal services market • Improve profitability GOALS GOALS
• Better manage slowly-changing dimensions • Solidify systems of ownership • Optimize EDW for build performance and query performance • Point-in-time reporting with flexible level of granularity (people changes, client/matter changes) • Determine a mechanism for handling effective dated and future-dated data GOALS
• Ensure all data in NBI system has a home in the EDW • Enable a more efficient view into GL actuals and budgets • Establish governance processes for new key entities, data elements, and data marts METHODOLOGY METHODOLOGY
• Data Workshop • Create a centralized framework and include data across all core firm systems • Start with People, Clients, Matters – Current view – Effective Dated – Point in Time
METHODOLOGY
• Incorporate processes that scrub the data to ensure completeness and accuracy • Partner with Microsoft for best practices around: – Security – Performance • Document our data architecture EXTENDED PROPERTIES EXAMPLE EXTENDED PROPERTIES EXAMPLE METHODOLOGY: SECURITY
• Row level security • Transparent Data Encryption • Multiple schemas and data marts • Auditing capabilities • Dynamic Data Masking METHODOLOGY: SECURITY METHODOLOGY: PERFORMANCE
• SQL Merge • In-Memory Clustered Columnstore Indexing • Optimized for ad-hoc workloads and star join optimization • Partitions are an option moving forward RESULTS BUSINESS IMPACT
• Consolidation of numerous firm databases • We have a suite of business intelligence tools that we can customize for clients and the firm • Enhanced forecasting capabilities • Data to support every strategic decision • Ability to use predictive analytics, advanced data mining and machine learning. BUSINESS IMPACT
ClientInsight facilitates: • Data discovery, an iterative process of discovering patterns and outliers in data • Data literacy to derive increasingly meaningful information • Predictive analytics to improve forecasting so the firm and its clients have a 360-degree view of our engagements, which promotes better strategic decision making LOOKING AHEAD FUTURE PLANS
• Continued client outreach • Incorporate Big Data into ClientInsight for added value • Machine Learning and Artificial Intelligence • Data Storytelling • Continue to evolve as data management evolves TIPS TO BOLSTER BUSINESS AND IMPROVE PROFITABILITY USING DATA SCIENCE TIPS AND ADVICE - METHODOLOGY
• Create a multi-disciplinary team • Data Workshop – Define success criteria – Define goals – Identify a starting point • Focus on your client needs and increased profitability will follow TIPS AND ADVICE – TOOLS
• Redgate SQL Tools for source control, searches, documentation (Extended Properties) • Data visualization tools • Invest in predictive analytics capabilities DATA MINING DATA MINING IN THE LEGAL PROFESSION
Finding items that meet human-defined criteria and detecting patterns in data • IF this document is a non-disclosure agreement, THEN send it to the legal department for review • IF this NDA meets the following criteria, THEN approve it for signature • FIND all my contracts with automatic renewal clauses and NOTIFY ME four weeks before they renew • TELL ME which patents in this portfolio will expire in the next six months
https://jolt.law.harvard.edu/digest/a-primer-on-using-artificial-intelligence-in-the-legal-profession USE CASES
• Document Review – Predictive coding : reduces the volume of irrelevant documents attorneys must wade through • Analyzing Contracts – Clients need to analyze contracts both in bulk and on an individual basis.
https://jolt.law.harvard.edu/digest/a-primer-on-using-artificial-intelligence-in-the-legal-profession DATA MINING PYRAMID
High Decision making and strategic planning
Data analysis Processing Data retrieval and aggregation
Data management and Storage Low DATA SCIENCE LIFECYCLE
1. Business understanding 2. Data acquisition and understanding 3. Modeling 4. Deployment 5. Customer acceptance
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle DATA MINING & MACHINE LEARNING
• Classification • Regression • Clustering • Generation
ALL business areas can harness the power of big data and data mining to gain insight and knowledge CLASSIFICATION
• It predicts, for each individual in a population, which class/category this individual belongs to. Ex: – Among all the customers, which are likely to respond to a given offer? – Is this transaction a fraud transaction? – What type of document is this? – Is this email a spam? – Is the student likely to get an A in this course? – Is the image cat, dog, or horse? (multi-class) CLASSIFICATION
Linear Classification Algorithm
Data with class label
New data (height = 200cm, Male weight = 100kg). CLASSIFICATION : USE CASE
Application Reason: • I am getting married and need cash advance • I am starting a new business… REGRESSION
• It attempts to estimate or predict, for each individual, the numerical value of s ome va ria b le for tha t ind ividua l. • For example: – how much will a given customer use the service? – how likely will a given voter support a candidate? – how much will the product demand change if the price increases by 3%? REGRESSION
Regression Algorithm
New data Weight = 100kg height = 200cm UNSTRUCTURED TEXT IS EVERYWHERE!
Physicians (1 million) Nurses (4 million) In Patient (36 million) 1 million X 10 pages X 251 4 million X 5 pages X 251 36 million X 4 page X 4.5 Physicians Content/Day Days Nurses Content/Day Days Patient stays/yr Content/Day avg = 5.02 Tb per year = 10 Tb per year length of stay = 1.3 Tb per year 16.3 Terabytes of unstructured narratives are created annually. 0.5 Petabyte of unstructured narratives sit unutilized in the US Electronic Medical Systems DEEP LEARNING FOR UNREVEALING FACES
Image Source: University of Florida Health no copyright infringement is intended
Tobacco Biomedical Anxiety Risk of Usage measures Prediction Suicidal DEEP LEARNING AS A DECISION MAKER
Prior Condition: (Patient Clinical History: 34 year old female with history of sickle cell disease status post right total shoulder) Train: (Findings: Patient used prophylactic antibiotics before and after the procedure as prescribed The patient received antibiotic prophylaxis with oral Ciprofloxacin 500mg and Keflex 500 mg The patient took 1 tablet of Ciprofloxacin 500mg the night before and morning of the procedure The patient also took 2 tablets of Keflex 500 mg the morning of the procedure After the procedures, the patient will take 2 tablets of Ciprofloxacin 500mg and 2 tablets of Keflex 500 mg Following the discussion of the risks, benefits, and alternatives of the procedure, informed consent was obtained Dynatrim was used to localize the left mid gland peripheral zone of the prostate at the site of the T2 hypointense lesion Using the Dynatrim guidance system an 18-gauge MR- compatible needle was inserted into the left side of the prostate Upon confirmation of the tip of the needles close to the suspected lesion, MR guided biopsy using an 18 gauge biopsy needle was performed and 3 cores were obtained The cores were hand delivered to the pathology lab The patient tolerated the procedure well and was discharged without any immediate complications ESTIMATED BLOOD LOSS: Less than 5 cc Predict : (Impressions: Successful core-biopsies of the left mid gland peripheral zone lesion) HEALTHCARE LANGUAGE IS DIFFERENT BUT NOT IMPOSSIBLE…
• http://annotations.inferenceanalytics.com/ (viewer/ inaViewer18 ) KNOWLEDGE GENERATION ON TEXT, IMAGE, VIDEO, SPEECH
• New trend of deep learning based predictions https://affinelayer.com/pixsrv/ • Can we predict the next tweet of President Trump? LEARNING TRENDS IN 2019
Transfer Learning Multi-Task Learning (i.e. Autonomous Vehicles) Federated Learning / Edge Computing Integration to Emerging Tech (i.e. IoT, Blockchain) DEMO
• Connect to Graph Databases (Neo4J) • Analyze Paradise Papers Dataset Q&A