DATA ENGINEERING & MINING

Exploring Trends and Approaches from Inside and Outside of Our Industry Series, Part 3 of 4 21 August 2019

#ILTACON19 Data Science Series Schedule

Monday Tuesday Wednesday Thursday @ 3:30 PM @ 3:30 PM @ 3:30 PM @ 11:30 AM

Session 1 Session 2 Session 3 Session 4

LAW in the ERA of The RISE of the LEGAL DATA ENGINEERING and DATA VISUALIZATION in LEGAL DATA SCIENCE DATA SCIENCE TEAM the LEGAL INDUSTRY

Ed Walters, Dazza Carmin Ballou, Eric J. Lisa Mayo, Shree Melaina Fireman, Greenwood Felsberg, Mike Klastava Bharadwaj Andrew P. Medeiros, Mark Thorogood SPEAKERS/MODERATOR

Shreenidhi Bharadwaj Lisa Mayo ANDREW BAKER (Moderator) VP, Data & Analytics Director of Data Senior Director, Adjunct Professor Management Digital Services + University of Chicago Ballard Spahr LLP Analytics [email protected] [email protected] HBR Consulting SESSION NEED Data Engineering + Data Mining Management Data Engineering + Data Mining

How we store, stage, prep and How we explore data and begin to ready data for consumption derive meaning, insights and value from those assets DATA ENGINEERING THE DATA REVOLUTION

• Data is changing the world. Data to this century is what oil was for the last century - A driver for growth, change, and success.

David Parkins ://www.youtube.com/watch?v=4ycC0DJqrpc https://leewardcapitalmgt.com/the-economist-the-worlds-most-valuable-resource-is-no-longer-oil-but-data/ BIG DATA

• Facebook: stores 400 PB data, with an incoming daily rate of about 600 TB. (as of 2017) • YouTube: 1000 PB video storage, 100 M views/day • Google: 4M searches/minute, stores 10 EB data(estimation) • AT&T: 1.9 T phone call records, 70,000 calls/second • US Credit cards: 1.4 B cards, 20 B transactions/year • Your Law Firm: 1 Bazillion Documents DATA-DRIVEN STUDY IN LEGAL INDUSTRY

• Critical inputs are overlooked and suggests that many law firms may be missing data-oriented opportunities for growth – Expanding client base – Billing more hours, etc. • Are firms missing opportunities to improve the practice of law itself?

https://www.clio.com/resources/legal-trends/2018-report/ DATA LIFECYCLE: ENABLING BUSINESS GROWTH

• Data Lifecycle Management (DLM) is a process that helps organizations manage the flow of data throughout its lifecycle—from creation, to use, to sharing, archive and deletion. Analyze Share

Capture Curate Store Aggregate Iterate Archive

Create Enrich Secure ENTERPRISE DATA MANAGEMENT

• Holistic framework comprising the people, processes and technology that optimizes data from a variety of different sources, then makes it available when and where it’s needed ( harmonization )

https://www.firstsanfranciscopartners.com/data-management/ ALIGNING BUSINESS STRATEGY & DATA STRATEGY

• A Successful Data Strategy links Business Goals with Technology Solutions

https://globaldatastrategy.com/our-services/enterprise-data-strategy/ DATA ENGINEERING

• Aspect of data science that Ingest/ focuses on automation of Extract practical applications of data collection, curation, analysis and Analyze/ Prepare/ delivery in batch as well as in Deliver Clean near real time.

Store/ Organize DATA FORMATS

Numerical Text

Media – audio, video Geospatial POPULAR DATABASES

Database Type Database Names Relational Key-Value Column Document

Graph OPERATIONAL VERSUS HISTORICAL DATA

Turns the Wheel of the Organization Operational (OLTP)

Databases

Analytical (OLAP)

Watch the Wheels of the Organization DATA PIPELINE

ETL Reporting/ BI Users/ Analysts Data Marts Operational Databases DATA WAREHOUSE

• The data warehouse is a Informational structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. The purpose of the data Enterprise Data Warehouse(EDW) warehouse is the retrieval of analytical information. A data warehouse can store detailed and/or summarized data. Analytical Data Mining DATA MARTS

• Subset of data from a data warehouse • Confined to data specific to a single line of business or department e.g. Finance or Marketing • Features: – Subject oriented – Small in size (few tables) – Customized by department – Source is departmentally structured data warehouse EXTRACT, TRANSFORM, LOAD (ETL)

• Creating ETL infrastructure is often the most time and resource- consuming part of the data warehouse development process

https://dzone.com/ DATA LAKE

• “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ DATA LAKE ARCHITECTURE

https://dzone.com/ DATA LAKE VS DATA WAREHOUSE

Attribute Data Warehouse Data Lake Schema Schema on write (predefined schemas) Schema on read (no predefined schemas) Scale Scales to large volumes at moderate cost - limited Scales to huge volumes at low cost - tens of thousands of storage and number of server nodes compute nodes Access methods Accessed through standardized SQL and BI tools Accessed through SQL-like systems, programs, and other methods

Workloads Batch processing, concurrent users performing Batch processing, stream processing, predictive analytics, improved interactive Analytics capability over EDWs for interactive queries New data Time consuming to introduce new content Fast ingestion of new data/content

Cost/efficiency Efficiently uses CPU/IO. Efficiently uses storage and processing capabilities at very low cost. Data Retention Limited - driven by retention policies Potential to retain all data (subject to retention policies) Users Reporting, Business Intelligence users Analytics, Data Scientists, Data Engineers

Key Benefits Provides a single enterprise wide view of data from Allows usage of raw structured and unstructured data from a centralized multiple sources low-cost store EXTRACT, LOAD, TRANSFORM (ELT)

• Loading of the extracted data, into a single, centralized data repository enabling unlimited access to all of the data at any time

https://dzone.com/ DOCUMENT DATABASES

• Data stored as documents ( multiple key-value pairs ) • Inherently a subclass of the key-value store Documents • Stores all information for a given object in a single instance in the database DATA MODELS

Tabular (Relational) Data Model Document Data Model Related data split across multiple records and tables Related data contained in a single, rich document DISCRETE TO CONNECTED DATA

RDBMS Hadoop / |<———————- Graph Database & ———————>| & EDW/ Graph Compute Engine Aggregate-Oriented Columnar RDBMS (Graph Transactions & Analytics) NoSQL Illustration by David Somerville based on the original by Hugh McLeod (@gapingvoid) GRAPH DATABASES Small network of Twitter users • Graph is just a collection of vertices and edges - or, in less intimidating language, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. A small social graph Five graphs in the world of business—social, intent, consumption, interest, and mobile - Ability to leverage these graphs provides a “sustainable competitive advantage.” - Gartner Graph Databases 2nd Edition : By Ian Robinson, Jim Webber, and Emil Eifrém THE LABELED PROPERTY GRAPH MODEL

• A labeled property graph has the following characteristics: – Contains nodes and relationships. – Nodes contain properties (key-value pairs). – Nodes can be labeled with one or more labels. – Relationships are named and directed, and always have a start and end node. – Relationships can also contain properties. RELATIONAL VERSUS GRAPH MODELS

• No more tables, no more foreign keys, no more joins Graph Model

SHREE Terminator

Titanic RATED SHREE Toy Story

Person Ratings Movie GRAPH DATABASE (NEO4J)

• Database management system with Create, Read, Update and Delete (CRUD) operations working on a graph data model. • Generally built for use with On line transaction processing (OLTP) systems.

Relationships are first-class citizens of the graph data model NOW, LET’S EXPLORE THESE CONCEPTS AS APPLIED TO A LAW FIRM ADVANCED DATA MANAGEMENT IN PRACTICE ENTERPRISE DATA WAREHOUSE PROJECT: WHAT… WHY… HOW… DRIVERS FOR CHANGE/INVESTMENT

Issues: • Loads of data from disparate systems each with its own metadata • No firm-wide tool to connect the dots • We needed greater agility for system conversions DRIVERS FOR CHANGE/INVESTMENT

Realization: How we manage, analyze, and leverage data to drive quality and add value will: • Differentiate us in a demanding legal services market • Improve profitability GOALS GOALS

• Better manage slowly-changing dimensions • Solidify systems of ownership • Optimize EDW for build performance and query performance • Point-in-time reporting with flexible level of granularity (people changes, client/matter changes) • Determine a mechanism for handling effective dated and future-dated data GOALS

• Ensure all data in NBI system has a home in the EDW • Enable a more efficient view into GL actuals and budgets • Establish governance processes for new key entities, data elements, and data marts METHODOLOGY METHODOLOGY

• Data Workshop • Create a centralized framework and include data across all core firm systems • Start with People, Clients, Matters – Current view – Effective Dated – Point in Time

METHODOLOGY

• Incorporate processes that scrub the data to ensure completeness and accuracy • Partner with Microsoft for best practices around: – Security – Performance • Document our data architecture EXTENDED PROPERTIES EXAMPLE EXTENDED PROPERTIES EXAMPLE METHODOLOGY: SECURITY

• Row level security • Transparent Data • Multiple schemas and data marts • Auditing capabilities • Dynamic Data Masking METHODOLOGY: SECURITY METHODOLOGY: PERFORMANCE

• SQL Merge • In-Memory Clustered Columnstore Indexing • Optimized for ad-hoc workloads and star join optimization • Partitions are an option moving forward RESULTS BUSINESS IMPACT

• Consolidation of numerous firm databases • We have a suite of business intelligence tools that we can customize for clients and the firm • Enhanced forecasting capabilities • Data to support every strategic decision • Ability to use predictive analytics, advanced data mining and machine learning. BUSINESS IMPACT

ClientInsight facilitates: • Data discovery, an iterative process of discovering patterns and outliers in data • Data literacy to derive increasingly meaningful information • Predictive analytics to improve forecasting so the firm and its clients have a 360-degree view of our engagements, which promotes better strategic decision making LOOKING AHEAD FUTURE PLANS

• Continued client outreach • Incorporate Big Data into ClientInsight for added value • Machine Learning and Artificial Intelligence • Data Storytelling • Continue to evolve as data management evolves TIPS TO BOLSTER BUSINESS AND IMPROVE PROFITABILITY USING DATA SCIENCE TIPS AND ADVICE - METHODOLOGY

• Create a multi-disciplinary team • Data Workshop – Define success criteria – Define goals – Identify a starting point • Focus on your client needs and increased profitability will follow TIPS AND ADVICE – TOOLS

• Redgate SQL Tools for source control, searches, documentation (Extended Properties) • Data visualization tools • Invest in predictive analytics capabilities DATA MINING DATA MINING IN THE LEGAL PROFESSION

Finding items that meet human-defined criteria and detecting patterns in data • IF this document is a non-disclosure agreement, THEN send it to the legal department for review • IF this NDA meets the following criteria, THEN approve it for signature • FIND all my contracts with automatic renewal clauses and NOTIFY ME four weeks before they renew • TELL ME which patents in this portfolio will expire in the next six months

https://jolt.law.harvard.edu/digest/a-primer-on-using-artificial-intelligence-in-the-legal-profession USE CASES

• Document Review – Predictive coding : reduces the volume of irrelevant documents attorneys must wade through • Analyzing Contracts – Clients need to analyze contracts both in bulk and on an individual basis.

https://jolt.law.harvard.edu/digest/a-primer-on-using-artificial-intelligence-in-the-legal-profession DATA MINING PYRAMID

High Decision making and strategic planning

Data analysis Processing Data retrieval and aggregation

Data management and Storage Low DATA SCIENCE LIFECYCLE

1. Business understanding 2. Data acquisition and understanding 3. Modeling 4. Deployment 5. Customer acceptance

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle DATA MINING & MACHINE LEARNING

• Classification • Regression • Clustering • Generation

ALL business areas can harness the power of big data and data mining to gain insight and knowledge CLASSIFICATION

• It predicts, for each individual in a population, which class/category this individual belongs to. Ex: – Among all the customers, which are likely to respond to a given offer? – Is this transaction a fraud transaction? – What type of document is this? – Is this a spam? – Is the student likely to get an A in this course? – Is the image cat, dog, or horse? (multi-class) CLASSIFICATION

Linear Classification Algorithm

Data with class label

New data (height = 200cm, Male weight = 100kg). CLASSIFICATION : USE CASE

Application Reason: • I am getting married and need cash advance • I am starting a new business… REGRESSION

• It attempts to estimate or predict, for each individual, the numerical value of s ome va ria b le for tha t ind ividua l. • For example: – how much will a given customer use the service? – how likely will a given voter support a candidate? – how much will the product demand change if the price increases by 3%? REGRESSION

Regression Algorithm

New data Weight = 100kg height = 200cm UNSTRUCTURED TEXT IS EVERYWHERE!

Physicians (1 million) Nurses (4 million) In Patient (36 million) 1 million X 10 pages X 251 4 million X 5 pages X 251 36 million X 4 page X 4.5 Physicians Content/Day Days Nurses Content/Day Days Patient stays/yr Content/Day avg = 5.02 Tb per year = 10 Tb per year length of stay = 1.3 Tb per year 16.3 Terabytes of unstructured narratives are created annually. 0.5 Petabyte of unstructured narratives sit unutilized in the US Electronic Medical Systems DEEP LEARNING FOR UNREVEALING FACES

Image Source: University of Florida Health no copyright infringement is intended

Tobacco Biomedical Anxiety Risk of Usage measures Prediction Suicidal DEEP LEARNING AS A DECISION MAKER

Prior Condition: (Patient Clinical History: 34 year old female with history of sickle cell disease status post right total shoulder) Train: (Findings: Patient used prophylactic antibiotics before and after the procedure as prescribed The patient received antibiotic prophylaxis with oral Ciprofloxacin 500mg and Keflex 500 mg The patient took 1 tablet of Ciprofloxacin 500mg the night before and morning of the procedure The patient also took 2 tablets of Keflex 500 mg the morning of the procedure After the procedures, the patient will take 2 tablets of Ciprofloxacin 500mg and 2 tablets of Keflex 500 mg Following the discussion of the risks, benefits, and alternatives of the procedure, informed consent was obtained Dynatrim was used to localize the left mid gland peripheral zone of the prostate at the site of the T2 hypointense lesion Using the Dynatrim guidance system an 18-gauge MR- compatible needle was inserted into the left side of the prostate Upon confirmation of the tip of the needles close to the suspected lesion, MR guided biopsy using an 18 gauge biopsy needle was performed and 3 cores were obtained The cores were hand delivered to the pathology lab The patient tolerated the procedure well and was discharged without any immediate complications ESTIMATED BLOOD LOSS: Less than 5 cc Predict : (Impressions: Successful core-biopsies of the left mid gland peripheral zone lesion) HEALTHCARE LANGUAGE IS DIFFERENT BUT NOT IMPOSSIBLE…

• http://annotations.inferenceanalytics.com/ (viewer/ inaViewer18 ) KNOWLEDGE GENERATION ON TEXT, IMAGE, VIDEO, SPEECH

• New trend of deep learning based predictions https://affinelayer.com/pixsrv/ • Can we predict the next tweet of President Trump? LEARNING TRENDS IN 2019

Transfer Learning Multi-Task Learning (i.e. Autonomous Vehicles) Federated Learning / Edge Computing Integration to Emerging Tech (i.e. IoT, Blockchain) DEMO

• Connect to Graph Databases (Neo4J) • Analyze Paradise Papers Dataset Q&A