
DATA ENGINEERING & DATA MINING Exploring Trends and Approaches from Inside and Outside of Our Industry Data Science Series, Part 3 of 4 21 August 2019 #ILTACON19 Data Science Series Schedule Monday Tuesday Wednesday Thursday @ 3:30 PM @ 3:30 PM @ 3:30 PM @ 11:30 AM Session 1 Session 2 Session 3 Session 4 LAW in the ERA of The RISE of the LEGAL DATA ENGINEERING and DATA VISUALIZATION in LEGAL DATA SCIENCE DATA SCIENCE TEAM DATA MINING the LEGAL INDUSTRY Ed Walters, Dazza Carmin Ballou, Eric J. Lisa Mayo, Shree Melaina Fireman, Greenwood Felsberg, Mike Klastava Bharadwaj Andrew P. Medeiros, Mark Thorogood SPEAKERS/MODERATOR Shreenidhi Bharadwaj Lisa Mayo ANDREW BAKER (Moderator) VP, Data & Analytics Director of Data Senior Director, Adjunct Professor Management Digital Services + University of Chicago Ballard Spahr LLP Analytics [email protected] [email protected] HBR Consulting SESSION NEED Data Engineering + Data Mining Management Data Engineering + Data Mining Data Management How we store, stage, prep and How we explore data and begin to ready data for consumption derive meaning, insights and value from those assets DATA ENGINEERING THE DATA REVOLUTION • Data is changing the world. Data to this century is what oil was for the last century - A driver for growth, change, and success. David Parkins https://www.youtube.com/watch?v=4ycC0DJqrpc https://leewardcapitalmgt.com/the-economist-the-worlds-most-valuable-resource-is-no-longer-oil-but-data/ BIG DATA • Facebook: stores 400 PB data, with an incoming daily rate of about 600 TB. (as of 2017) • YouTube: 1000 PB video storage, 100 M views/day • Google: 4M searches/minute, stores 10 EB data(estimation) • AT&T: 1.9 T phone call records, 70,000 calls/second • US Credit cards: 1.4 B cards, 20 B transactions/year • Your Law Firm: 1 Bazillion Documents DATA-DRIVEN STUDY IN LEGAL INDUSTRY • Critical inputs are overlooked and suggests that many law firms may be missing data-oriented opportunities for growth – Expanding client base – Billing more hours, etc. • Are firms missing opportunities to improve the practice of law itself? https://www.clio.com/resources/legal-trends/2018-report/ DATA LIFECYCLE: ENABLING BUSINESS GROWTH • Data Lifecycle Management (DLM) is a process that helps organizations manage the flow of data throughout its lifecycle—from creation, to use, to sharing, archive and deletion. Analyze Share Capture Curate Store Aggregate Iterate Archive Create Enrich Secure ENTERPRISE DATA MANAGEMENT • Holistic framework comprising the people, processes and technology that optimizes data from a variety of different sources, then makes it available when and where it’s needed ( harmonization ) https://www.firstsanfranciscopartners.com/data-management/ ALIGNING BUSINESS STRATEGY & DATA STRATEGY • A Successful Data Strategy links Business Goals with Technology Solutions https://globaldatastrategy.com/our-services/enterprise-data-strategy/ DATA ENGINEERING • Aspect of data science that Ingest/ focuses on automation of Extract practical applications of data collection, curation, analysis and Analyze/ Prepare/ delivery in batch as well as in Deliver Clean near real time. Store/ Organize DATA FORMATS Numerical Text Media – audio, video Geospatial POPULAR DATABASES Database Type Database Names Relational Key-Value Column Document Graph OPERATIONAL VERSUS HISTORICAL DATA Turns the Wheel of the Organization Operational (OLTP) Databases Analytical (OLAP) Watch the Wheels of the Organization DATA PIPELINE ETL Reporting/ Data Warehouse BI Users/ Analysts Data Marts Operational Databases DATA WAREHOUSE • The data warehouse is a Informational structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. The purpose of the data Enterprise Data Warehouse(EDW) warehouse is the retrieval of analytical information. A data warehouse can store detailed and/or summarized data. Analytical Data Mining DATA MARTS • Subset of data from a data warehouse • Confined to data specific to a single line of business or department e.g. Finance or Marketing • Features: – Subject oriented – Small in size (few tables) – Customized by department – Source is departmentally structured data warehouse EXTRACT, TRANSFORM, LOAD (ETL) • Creating ETL infrastructure is often the most time and resource- consuming part of the data warehouse development process https://dzone.com/ DATA LAKE • “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ DATA LAKE ARCHITECTURE https://dzone.com/ DATA LAKE VS DATA WAREHOUSE Attribute Data Warehouse Data Lake Schema Schema on write (predefined schemas) Schema on read (no predefined schemas) Scale Scales to large volumes at moderate cost - limited Scales to huge volumes at low cost - tens of thousands of storage and number of server nodes compute nodes Access methods Accessed through standardized SQL and BI tools Accessed through SQL-like systems, programs, and other methods Workloads Batch processing, concurrent users performing Batch processing, stream processing, predictive analytics, improved interactive Analytics capability over EDWs for interactive queries New data Time consuming to introduce new content Fast ingestion of new data/content Cost/efficiency Efficiently uses CPU/IO. Efficiently uses storage and processing capabilities at very low cost. Data Retention Limited - driven by retention policies Potential to retain all data (subject to retention policies) Users Reporting, Business Intelligence users Analytics, Data Scientists, Data Engineers Key Benefits Provides a single enterprise wide view of data from Allows usage of raw structured and unstructured data from a centralized multiple sources low-cost store EXTRACT, LOAD, TRANSFORM (ELT) • Loading of the extracted data, into a single, centralized data repository enabling unlimited access to all of the data at any time https://dzone.com/ DOCUMENT DATABASES • Data stored as documents ( multiple key-value pairs ) • Inherently a subclass of the key-value store Documents • Stores all information for a given object in a single instance in the database DATA MODELS Tabular (Relational) Data Model Document Data Model Related data split across multiple records and tables Related data contained in a single, rich document DISCRETE TO CONNECTED DATA RDBMS Hadoop / |<———————- Graph Database & ———————>| & EDW/ Graph Compute Engine Aggregate-Oriented Columnar RDBMS (Graph Transactions & Analytics) NoSQL Illustration by David Somerville based on the original by Hugh McLeod (@gapingvoid) GRAPH DATABASES Small network of Twitter users • Graph is just a collection of vertices and edges - or, in less intimidating language, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. A small social graph Five graphs in the world of business—social, intent, consumption, interest, and mobile - Ability to leverage these graphs provides a “sustainable competitive advantage.” - Gartner Graph Databases 2nd Edition : By Ian Robinson, Jim Webber, and Emil Eifrém THE LABELED PROPERTY GRAPH MODEL • A labeled property graph has the following characteristics: – Contains nodes and relationships. – Nodes contain properties (key-value pairs). – Nodes can be labeled with one or more labels. – Relationships are named and directed, and always have a start and end node. – Relationships can also contain properties. RELATIONAL VERSUS GRAPH MODELS • No more tables, no more foreign keys, no more joins Graph Model SHREE Terminator Titanic RATED SHREE Toy Story Person Ratings Movie GRAPH DATABASE (NEO4J) • Database management system with Create, Read, Update and Delete (CRUD) operations working on a graph data model. • Generally built for use with On line transaction processing (OLTP) systems. Relationships are first-class citizens of the graph data model NOW, LET’S EXPLORE THESE CONCEPTS AS APPLIED TO A LAW FIRM ADVANCED DATA MANAGEMENT IN PRACTICE ENTERPRISE DATA WAREHOUSE PROJECT: WHAT… WHY… HOW… DRIVERS FOR CHANGE/INVESTMENT Issues: • Loads of data from disparate systems each with its own metadata • No firm-wide tool to connect the dots • We needed greater agility for system conversions DRIVERS FOR CHANGE/INVESTMENT Realization: How we manage, analyze, and leverage data to drive quality and add value will: • Differentiate us in a demanding legal services market • Improve profitability GOALS GOALS • Better manage slowly-changing dimensions • Solidify systems of ownership • Optimize EDW for build performance and query performance • Point-in-time reporting with flexible level of granularity (people changes, client/matter changes) • Determine a mechanism for handling effective dated and future-dated data GOALS • Ensure all data in NBI system has a home in the EDW • Enable a more efficient view into GL actuals and budgets • Establish governance processes for new key entities, data elements, and data marts METHODOLOGY METHODOLOGY • Data Workshop • Create a centralized framework and include data across all core firm systems • Start with People, Clients, Matters – Current view – Effective Dated – Point in Time METHODOLOGY • Incorporate processes that scrub the data to ensure completeness and accuracy • Partner with Microsoft for best practices around: – Security –
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages77 Page
-
File Size-