Governing Your Cloud-Based Enterprise Data Lake

Governing your cloud-based enterprise data lake Selwyn Collaco Ben Sharma Chief Data Officer Founder and CEO, TMX Zaloni 1 TMX Data Strategy Unlocking The Business Value Of DATA, An Enterprise Asset 2 Enterprise Data Strategy “Our strategy is that we use DATA as an Enterprise (TMX) Asset to unlock Business Value and enable Growth” 1 2 3 Consolidated Secure & Advanced Data Asset Accessible Analytics & Visualization 3 3 Foundational Enablers for Transforming TMX Data Assets 2 1 3 • • • • • • • • • • • • • • • 4 • • • 5 3 4 Update On Our Journey “Unlocking business value through our DATA” Cloud C360 ▪ Enable cross sell & up sell Client C360 ▪ Enterprise-Wide Account management Holistic view of our customers through TMX ▪ Enterprise Client Relationship view Governance process framework for managing Data Governance ▪ Reduce time and effort for Reporting & Analysis Enterprise BI ▪ Self-serve BI and Empowered business users Enterprise wide BI and Data Visualization ▪ Improved Quality and Better insights data ▪ Ad-hoc analytical capability in milliseconds Analytics as a Service ▪ Real-time calculations Use case driven search based analytics ▪ White label for monetization opportunities Enterprise Data Platform ▪ Data Platform supporting Business Insights Enterprise Data Lake – Enterprise Wide data repository ▪ Technology enabling new Business Capabilities • Research & Product Development • Improved data fulfilment Data ownership • Advanced Analytics ▪ ▪ Data classification ▪ Data quality 5 Enterprise Data & Analytics Platform ENABLING NEW CAPABILITIES SINGLE SOURCE OF DATA | DATA MARKET PLACE | 3600 VIEW | ADVANCED ANALYTICS & VISUALIZATION TMX Data Assets TMX Enterprise Data Repository Data Consumption Analyst Workbench Advanced Analytics Data Lake Trading Data ETF Broker (Equities & Market Data Derivatives) Ingest Govern Data Discovery Distribute Order book Batch/ Master and Analytics GRAPEVINE Self-Service Visualization Streaming Metadata Plug & Play Applications Management Security Related Data Prep Clearing Data File/ Relational Data Data Lineage Data Internal/ Externally Data Profiling & Consumption Hosted/ Data Preparation Data Security & Third Party Access controls Data Catalog Listing Data Quality Billing / Custom Solutions Information Finance Data Analytics As a Service Powered By: Customer Data Alternate Data (Salesforce) Self-Service BI & Reporting Pre-defined Analytic Templates GOVERNANCE DATA POLICIES & OVERSIGHT | DATA DEFINITION | DATA MANAGEMENT | DATA CLASSIFICATION 6 Non-Invasive Data Governance Approach In order to develop a Data Governance structure that is right-sized for TMX, the proposed Enterprise Data Governance structure has been developed based on key observations identified through stakeholder interviews, which provided insights in to key opportunities and challenges, data domain structure1 and in-flight data initiatives TMX Group Global Solutions, Insights and Global Equity Capital Markets Montreal Exchange (MX) & CDS / CDCC Analytics (GSIA) (includes Equity Trading & Listings) Regulatory (MXR) TSX Trust Shorcan Enterprise Strategy Global Technology Solutions (GTS) Finance Marketing Legal & ERM Human Resources Note: Working sessions did not include Trayport as they were a recent acquisition Key Opportunities and Data Domains & In-flight Enterprise Data Challenges Identified Initiatives Governance Structure Lack of data ownership - Trading – Deriv. - Marketing - Trading – Equities - Customer Data accessibility and security - Trading – FI - Legal - Listings - Human Resources Duplicative Data - Clearing – Deriv. - Risk - Tech. Operations Data Domains - Clearing – Equities Third Party Data Management - Regulatory – Deriv. - External/Alternative - Finance Data Sharing Capabilities Manual Processes Advance Analytics Client 360 Platform Data retention / access to historical Issuer Services Workday data Excel. In-flight Initiatives Tableau Atlas Based on the identified key opportunities and An ongoing responsibility of the Data Owners for each domain have been identified and will challenges, as well as TMX’s data domain structure, a Governance bodies will be to prioritize and serve as the foundation for the EDGC; an in-flight two-tier structure is created address data opportunities and challenges project will be selected for preliminary implementation 1 7 7 Data domains and data domain owners are further detailed in the TMX Data Asset Catalogue Enterprise Data Platform – Key Technology Options ➢ There are number of Technology solutions available in market place for building the Enterprise Data Platform. ➢ Technologies required for Data Platform falls in THREE categories. ➢ Chart describe the categories and key options within those categories. • On – Premise Servers Technology Platform • Cloud – AWS*, Azure • Cloudera (On-prem, Cloud) Big Data Hadoop Distribution • Hortonworks (On-prem, Cloud) • EMR (AWS) • Informatica (Enterprise ETL) Big Data Management Tool • IBM Data Stage (Enterprise ETL) • Zaloni (DataLake) Technology Selected 8 8 Transform your business with an integrated self-service data platform September 12, 2018 9 Today’s data confusion Zaloni Confidential and Proprietary - Provided under NDA 10 Zaloni Proprietary and Confidential The key is an integrated self-service data platform 1. Manage data across Advanced Analytics, BI, Enterprise on-prem and cloud Reporting, Applications environments 2. Metadata management to find useful data and enable Data Management, Governance, governance Self-service IntegratedZaloni self-service Data Platform data platform 3. Automation of data pipelines for scale and Compute efficiency Storage 4. Improve time to insight with self-service for data consumers 11 © Zaloni Inc. 2018, All rights reserved | Zaloni proprietary No matter where you are on your data lake journey... Modernize your platform Improve governance and Self-service data platform adoption ● Acquire useful data from across ● Need more agility and flexibility ● Early Hadoop adopters enterprise ● ● Not sure what data is in the swamp Need advanced analytics to reduce ● Improved visibility and understanding time to insight ● Need to right-size governance via metadata management ● ● Data duplicated across ponds Do not want the complexity of ● Ensure security/privacy of sensitive data building and integrating the platform ● Enterprise adoption ● Scalable production data lake for new ● Need to demonstrate ROI and improved business insights Govern Scale Out Architecture Traditional Architecture Governed Data Lake Data Swamp/Data Ponds Zaloni Confidential and Proprietary - Provided under NDA Data Warehouse 12 Zaloni Proprietary and Confidential Typical data lake reference architecture Transient Raw Zone Trusted Zone Landing Zone • Standardized on corporate governance/ quality policies Relational Data • Consumers are anyone with appropriate role-based Stores (OLTP/ODS/DW) access • Temporary store of • Original source data Refined Zone Logs source data ready for consumption • Data required for LOB specific views - transformed (or other unstructured from existing certified data • Consumers are IT, • Consumers are data) Data Stewards developers, data • Consumers are anyone with appropriate role-based access • Implemented in stewards, some data highly regulated scientists industries • Single source of truth Sensors with history Sandbox • Ad-hoc exploratory analysis (or other time series data) • Data scientists, data consumers with proper privileges Social and shared data Data Lake Zaloni’s integrated self-service data platform (ZDP) accelerates time to insight Provides the foundation for your data initiatives Enable Govern Engage • Batch Ingestion • Data Quality • Data Marketplace • • Streaming Ingestion Data Lineage • Self-Service Operations • • Metadata Capture Data Mastering ENGAGE • Data Privacy/Security • Auto Discovery • Data Enrichment GOVERN • Data Lifecycle Management ENABLE Cloud, On-Premises, Hybrid Infrastructure and cloud-platform agnostic 14 Zaloni proprietary – do not duplicate without permission Enable the Data Lake with ZDP Scale out ingestion of data - Batch and streaming ingestion - DB ingestion with support for full/incremental updates - Metadata capture and integration with Data Pipeline - Ingestion Wizard simplifies ingestion and creates reusable workflows Automated data inventory - Crawls the data store to catalog datasets - Automatically detects metadata in several cases Workflow and orchestration - Robust workflow management feature with scalable and extensible architecture Monitor the health of the data lake - Single unified log repository for all data lake operations - Dashboards for operational health of the data lake 15 Govern the Data Lake with ZDP Metadata management Security and access control - Business, technical and operational metadata - Masking and Tokenization of sensitive data - Captures lineage from ingestion to - RBAC for data lake artifacts – Entities, DQ rules, consumption ingestion with push down to underlying security - Integration with Enterprise Metadata framework repositories - Supports kerberized environment Data quality - Rules Engine for DQ Data enrichment - DQ reporting and analysis - Automation of DQ remediation process - Drag and drop enrichment of the data - Support for data operations – JOIN, FILTER, UNION, etc. Data lifecycle management - Out of the box format conversion, parsers with - Policy based DLM ability to extend with custom implementations - Leverages HDFS storage tiers and supports S3 interface 16 Engage the Business with ZDP Data marketplace - Rich

Governing Your Cloud-Based Enterprise Data Lake

Mapreduce-Based D ELT Framework to Address the Challenges of Geospatial Big Data

A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective

2020 Global Data Management Research the Data-Driven Organization, a Transformation in Progress Methodology

Value of Data Management

Best Practices in Data Collection and Preparation: Recommendations For

The Risks of Using Spreadsheets for Statistical Analysis Why Spreadsheets Have Their Limits, and What You Can Do to Avoid Them

Research Data Management Best Practices

Data Migration (Pdf)

Research Data Management: Sharing and Storage

Data Management, Analysis Tools, and Analysis Mechanics

Advanced Data Preparation for Individuals & Teams

Preparation for Data Migration to S4/Hana