Governing Your Cloud-Based Enterprise Data Lake

Total Page:16

File Type:pdf, Size:1020Kb

Governing Your Cloud-Based Enterprise Data Lake Governing your cloud-based enterprise data lake Selwyn Collaco Ben Sharma Chief Data Officer Founder and CEO, TMX Zaloni 1 TMX Data Strategy Unlocking The Business Value Of DATA, An Enterprise Asset 2 Enterprise Data Strategy “Our strategy is that we use DATA as an Enterprise (TMX) Asset to unlock Business Value and enable Growth” 1 2 3 Consolidated Secure & Advanced Data Asset Accessible Analytics & Visualization 3 3 Foundational Enablers for Transforming TMX Data Assets 2 1 3 • • • • • • • • • • • • • • • 4 • • • 5 3 4 Update On Our Journey “Unlocking business value through our DATA” Cloud C360 ▪ Enable cross sell & up sell Client C360 ▪ Enterprise-Wide Account management Holistic view of our customers through TMX ▪ Enterprise Client Relationship view Governance process framework for managing Data Governance ▪ Reduce time and effort for Reporting & Analysis Enterprise BI ▪ Self-serve BI and Empowered business users Enterprise wide BI and Data Visualization ▪ Improved Quality and Better insights data ▪ Ad-hoc analytical capability in milliseconds Analytics as a Service ▪ Real-time calculations Use case driven search based analytics ▪ White label for monetization opportunities Enterprise Data Platform ▪ Data Platform supporting Business Insights Enterprise Data Lake – Enterprise Wide data repository ▪ Technology enabling new Business Capabilities • Research & Product Development • Improved data fulfilment Data ownership • Advanced Analytics ▪ ▪ Data classification ▪ Data quality 5 Enterprise Data & Analytics Platform ENABLING NEW CAPABILITIES SINGLE SOURCE OF DATA | DATA MARKET PLACE | 3600 VIEW | ADVANCED ANALYTICS & VISUALIZATION TMX Data Assets TMX Enterprise Data Repository Data Consumption Analyst Workbench Advanced Analytics Data Lake Trading Data ETF Broker (Equities & Market Data Derivatives) Ingest Govern Data Discovery Distribute Order book Batch/ Master and Analytics GRAPEVINE Self-Service Visualization Streaming Metadata Plug & Play Applications Management Security Related Data Prep Clearing Data File/ Relational Data Data Lineage Data Internal/ Externally Data Profiling & Consumption Hosted/ Data Preparation Data Security & Third Party Access controls Data Catalog Listing Data Quality Billing / Custom Solutions Information Finance Data Analytics As a Service Powered By: Customer Data Alternate Data (Salesforce) Self-Service BI & Reporting Pre-defined Analytic Templates GOVERNANCE DATA POLICIES & OVERSIGHT | DATA DEFINITION | DATA MANAGEMENT | DATA CLASSIFICATION 6 Non-Invasive Data Governance Approach In order to develop a Data Governance structure that is right-sized for TMX, the proposed Enterprise Data Governance structure has been developed based on key observations identified through stakeholder interviews, which provided insights in to key opportunities and challenges, data domain structure1 and in-flight data initiatives TMX Group Global Solutions, Insights and Global Equity Capital Markets Montreal Exchange (MX) & CDS / CDCC Analytics (GSIA) (includes Equity Trading & Listings) Regulatory (MXR) TSX Trust Shorcan Enterprise Strategy Global Technology Solutions (GTS) Finance Marketing Legal & ERM Human Resources Note: Working sessions did not include Trayport as they were a recent acquisition Key Opportunities and Data Domains & In-flight Enterprise Data Challenges Identified Initiatives Governance Structure Lack of data ownership - Trading – Deriv. - Marketing - Trading – Equities - Customer Data accessibility and security - Trading – FI - Legal - Listings - Human Resources Duplicative Data - Clearing – Deriv. - Risk - Tech. Operations Data Domains - Clearing – Equities Third Party Data Management - Regulatory – Deriv. - External/Alternative - Finance Data Sharing Capabilities Manual Processes Advance Analytics Client 360 Platform Data retention / access to historical Issuer Services Workday data Excel. In-flight Initiatives Tableau Atlas Based on the identified key opportunities and An ongoing responsibility of the Data Owners for each domain have been identified and will challenges, as well as TMX’s data domain structure, a Governance bodies will be to prioritize and serve as the foundation for the EDGC; an in-flight two-tier structure is created address data opportunities and challenges project will be selected for preliminary implementation 1 7 7 Data domains and data domain owners are further detailed in the TMX Data Asset Catalogue Enterprise Data Platform – Key Technology Options ➢ There are number of Technology solutions available in market place for building the Enterprise Data Platform. ➢ Technologies required for Data Platform falls in THREE categories. ➢ Chart describe the categories and key options within those categories. • On – Premise Servers Technology Platform • Cloud – AWS*, Azure • Cloudera (On-prem, Cloud) Big Data Hadoop Distribution • Hortonworks (On-prem, Cloud) • EMR (AWS) • Informatica (Enterprise ETL) Big Data Management Tool • IBM Data Stage (Enterprise ETL) • Zaloni (DataLake) Technology Selected 8 8 Transform your business with an integrated self-service data platform September 12, 2018 9 Today’s data confusion Zaloni Confidential and Proprietary - Provided under NDA 10 Zaloni Proprietary and Confidential The key is an integrated self-service data platform 1. Manage data across Advanced Analytics, BI, Enterprise on-prem and cloud Reporting, Applications environments 2. Metadata management to find useful data and enable Data Management, Governance, governance Self-service IntegratedZaloni self-service Data Platform data platform 3. Automation of data pipelines for scale and Compute efficiency Storage 4. Improve time to insight with self-service for data consumers 11 © Zaloni Inc. 2018, All rights reserved | Zaloni proprietary No matter where you are on your data lake journey... Modernize your platform Improve governance and Self-service data platform adoption ● Acquire useful data from across ● Need more agility and flexibility ● Early Hadoop adopters enterprise ● ● Not sure what data is in the swamp Need advanced analytics to reduce ● Improved visibility and understanding time to insight ● Need to right-size governance via metadata management ● ● Data duplicated across ponds Do not want the complexity of ● Ensure security/privacy of sensitive data building and integrating the platform ● Enterprise adoption ● Scalable production data lake for new ● Need to demonstrate ROI and improved business insights Govern Scale Out Architecture Traditional Architecture Governed Data Lake Data Swamp/Data Ponds Zaloni Confidential and Proprietary - Provided under NDA Data Warehouse 12 Zaloni Proprietary and Confidential Typical data lake reference architecture Transient Raw Zone Trusted Zone Landing Zone • Standardized on corporate governance/ quality policies Relational Data • Consumers are anyone with appropriate role-based Stores (OLTP/ODS/DW) access • Temporary store of • Original source data Refined Zone Logs source data ready for consumption • Data required for LOB specific views - transformed (or other unstructured from existing certified data • Consumers are IT, • Consumers are data) Data Stewards developers, data • Consumers are anyone with appropriate role-based access • Implemented in stewards, some data highly regulated scientists industries • Single source of truth Sensors with history Sandbox • Ad-hoc exploratory analysis (or other time series data) • Data scientists, data consumers with proper privileges Social and shared data Data Lake Zaloni’s integrated self-service data platform (ZDP) accelerates time to insight Provides the foundation for your data initiatives Enable Govern Engage • Batch Ingestion • Data Quality • Data Marketplace • • Streaming Ingestion Data Lineage • Self-Service Operations • • Metadata Capture Data Mastering ENGAGE • Data Privacy/Security • Auto Discovery • Data Enrichment GOVERN • Data Lifecycle Management ENABLE Cloud, On-Premises, Hybrid Infrastructure and cloud-platform agnostic 14 Zaloni proprietary – do not duplicate without permission Enable the Data Lake with ZDP Scale out ingestion of data - Batch and streaming ingestion - DB ingestion with support for full/incremental updates - Metadata capture and integration with Data Pipeline - Ingestion Wizard simplifies ingestion and creates reusable workflows Automated data inventory - Crawls the data store to catalog datasets - Automatically detects metadata in several cases Workflow and orchestration - Robust workflow management feature with scalable and extensible architecture Monitor the health of the data lake - Single unified log repository for all data lake operations - Dashboards for operational health of the data lake 15 Govern the Data Lake with ZDP Metadata management Security and access control - Business, technical and operational metadata - Masking and Tokenization of sensitive data - Captures lineage from ingestion to - RBAC for data lake artifacts – Entities, DQ rules, consumption ingestion with push down to underlying security - Integration with Enterprise Metadata framework repositories - Supports kerberized environment Data quality - Rules Engine for DQ Data enrichment - DQ reporting and analysis - Automation of DQ remediation process - Drag and drop enrichment of the data - Support for data operations – JOIN, FILTER, UNION, etc. Data lifecycle management - Out of the box format conversion, parsers with - Policy based DLM ability to extend with custom implementations - Leverages HDFS storage tiers and supports S3 interface 16 Engage the Business with ZDP Data marketplace - Rich
Recommended publications
  • Mapreduce-Based D ELT Framework to Address the Challenges of Geospatial Big Data
    International Journal of Geo-Information Article MapReduce-Based D_ELT Framework to Address the Challenges of Geospatial Big Data Junghee Jo 1,* and Kang-Woo Lee 2 1 Busan National University of Education, Busan 46241, Korea 2 Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-51-500-7327 Received: 15 August 2019; Accepted: 21 October 2019; Published: 24 October 2019 Abstract: The conventional extracting–transforming–loading (ETL) system is typically operated on a single machine not capable of handling huge volumes of geospatial big data. To deal with the considerable amount of big data in the ETL process, we propose D_ELT (delayed extracting–loading –transforming) by utilizing MapReduce-based parallelization. Among various kinds of big data, we concentrate on geospatial big data generated via sensors using Internet of Things (IoT) technology. In the IoT environment, update latency for sensor big data is typically short and old data are not worth further analysis, so the speed of data preparation is even more significant. We conducted several experiments measuring the overall performance of D_ELT and compared it with both traditional ETL and extracting–loading– transforming (ELT) systems, using different sizes of data and complexity levels for analysis. The experimental results show that D_ELT outperforms the other two approaches, ETL and ELT. In addition, the larger the amount of data or the higher the complexity of the analysis, the greater the parallelization effect of transform in D_ELT, leading to better performance over the traditional ETL and ELT approaches. Keywords: ETL; ELT; big data; sensor data; IoT; geospatial big data; MapReduce 1.
    [Show full text]
  • A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective
    1 A Survey on Data Collection for Machine Learning A Big Data - AI Integration Perspective Yuji Roh, Geon Heo, Steven Euijong Whang, Senior Member, IEEE Abstract—Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research. Index Terms—data collection, data acquisition, data labeling, machine learning F 1 INTRODUCTION E are living in exciting times where machine learning expertise. This problem applies to any novel application that W is having a profound influence on a wide range of benefits from machine learning.
    [Show full text]
  • 2020 Global Data Management Research the Data-Driven Organization, a Transformation in Progress Methodology
    BenchmarkBenchmark Report report 2020 Global data management research The data-driven organization, a transformation in progress Methodology Experian conducted a survey to look at global trends in data management. This study looks at how data practitioners and data-driven business leaders are leveraging their data to solve key business challenges and how data management practices are changing over time. Produced by Insight Avenue for Experian in October 2019, the study surveyed more than 1,100 people across six countries around the globe: the United States, the United Kingdom, Germany, France, Brazil, and Australia. Organizations that were surveyed came from a variety of industries, including IT, telecommunications, manufacturing, retail, business services, financial services, healthcare, public sector, education, utilities, and more. A variety of roles from all areas of the organization were surveyed, which included titles such as chief data officer, chief marketing officer, data analyst, financial analyst, data engineer, data steward, and risk manager. Respondents were chosen based on their visibility into their organization’s customer or prospect data management practices. Page 2 | The data-driven organization, a transformation in progress Benchmark report 2020 Global data management research The data-informed organization, a transformation in progress Executive summary ..........................................................................................................................................................5 Section 1:
    [Show full text]
  • Value of Data Management
    USGS Data Management Training Modules: Value of Data Management “Data is a precious thing and will last longer than the systems themselves.” -- Tim Berners-Lee …. But only if that data is properly managed! U.S. Department of the Interior Accessibility | FOIA | Privacy | Policies and Notices U.S. Geological Survey Version 1, October 2013 Narration: Introduction Slide 1 USGS Data Management Training Modules – the Value of Data Management Welcome to the USGS Data Management Training Modules, a three part training series that will guide you in understanding and practicing good data management. Today we will discuss the Value of Data Management. As Tim Berners-Lee said, “Data is a precious thing and will last longer than the systems themselves.” ……We will illustrate here that the validity of that statement depends on proper management of that data! Module Objectives · Describe the various roles and responsibilities of data management. · Explain how data management relates to everyday work. · Emphasize the value of data management. From Flickr by cybrarian77 Narration:Slide 2 Module Objectives In this module, you will learn how to: 1. Describe the various roles and responsibilities of data management. 2. Explain how data management relates to everyday work and the greater good. 3. Motivate (with examples) why data management is valuable. These basic lessons will provide the foundation for understanding why good data management is worth pursuing. 2 Terms and Definitions · Data Management (DM) – the development, execution and supervision of plans,
    [Show full text]
  • Best Practices in Data Collection and Preparation: Recommendations For
    Feature Topic on Reviewer Resources Organizational Research Methods 2021, Vol. 24(4) 678–\693 ª The Author(s) 2019 Best Practices in Data Article reuse guidelines: sagepub.com/journals-permissions Collection and Preparation: DOI: 10.1177/1094428119836485 journals.sagepub.com/home/orm Recommendations for Reviewers, Editors, and Authors Herman Aguinis1 , N. Sharon Hill1, and James R. Bailey1 Abstract We offer best-practice recommendations for journal reviewers, editors, and authors regarding data collection and preparation. Our recommendations are applicable to research adopting dif- ferent epistemological and ontological perspectives—including both quantitative and qualitative approaches—as well as research addressing micro (i.e., individuals, teams) and macro (i.e., organizations, industries) levels of analysis. Our recommendations regarding data collection address (a) type of research design, (b) control variables, (c) sampling procedures, and (d) missing data management. Our recommendations regarding data preparation address (e) outlier man- agement, (f) use of corrections for statistical and methodological artifacts, and (g) data trans- formations. Our recommendations address best practices as well as transparency issues. The formal implementation of our recommendations in the manuscript review process will likely motivate authors to increase transparency because failure to disclose necessary information may lead to a manuscript rejection decision. Also, reviewers can use our recommendations for developmental purposes to highlight which particular issues should be improved in a revised version of a manuscript and in future research. Taken together, the implementation of our rec- ommendations in the form of checklists can help address current challenges regarding results and inferential reproducibility as well as enhance the credibility, trustworthiness, and usefulness of the scholarly knowledge that is produced.
    [Show full text]
  • The Risks of Using Spreadsheets for Statistical Analysis Why Spreadsheets Have Their Limits, and What You Can Do to Avoid Them
    The risks of using spreadsheets for statistical analysis Why spreadsheets have their limits, and what you can do to avoid them Let’s go Introduction Introduction Spreadsheets are widely used for statistical analysis; and while they are incredibly useful tools, they are Why spreadsheets are popular useful only to a certain point. When used for a task 88% of all spreadsheets they’re not designed to perform, or for a task at contain at least one error or beyond the limit of their capabilities, spread- sheets can be somewhat risky. An alternative to spreadsheets This paper presents some points you should consider if you use, or plan to use, a spread- sheet to perform statistical analysis. It also describes an alternative that in many cases will be more suitable. The learning curve with IBM SPSS Statistics Software licensing the SPSS way Conclusion Introduction Why spreadsheets are popular Why spreadsheets are popular A spreadsheet is an attractive choice for performing The answer to the first question depends on the scale and calculations because it’s easy to use. Most of us know (or the complexity of your data analysis. A typical spreadsheet 1 • 2 • 3 • 4 • 5 think we know) how to use one. Plus, spreadsheet programs will have a restriction on the number of records it can 6 • 7 • 8 • 9 come as a standard desktop computer resource, so they’re handle, so if the scale of the job is large, a tool other than a already available. spreadsheet may be very useful. A spreadsheet is a wonderful invention and an An alternative to spreadsheets excellent tool—for certain jobs.
    [Show full text]
  • Research Data Management Best Practices
    Research Data Management Best Practices Introduction ............................................................................................................................................................................ 2 Planning & Data Management Plans ...................................................................................................................................... 3 Naming and Organizing Your Files .......................................................................................................................................... 6 Choosing File Formats ............................................................................................................................................................. 9 Working with Tabular Data ................................................................................................................................................... 10 Describing Your Data: Data Dictionaries ............................................................................................................................... 12 Describing Your Project: Citation Metadata ......................................................................................................................... 15 Preparing for Storage and Preservation ............................................................................................................................... 17 Choosing a Repository .........................................................................................................................................................
    [Show full text]
  • Data Migration (Pdf)
    PTS Data Migration 1 Contents 2 Background ..................................................................................................................................... 2 3 Challenge ......................................................................................................................................... 3 3.1 Legacy Data Extraction ............................................................................................................ 3 3.2 Data Cleansing......................................................................................................................... 3 3.3 Data Linking ............................................................................................................................. 3 3.4 Data Mapping .......................................................................................................................... 3 3.5 New Data Preparation & Load ................................................................................................ 3 3.6 Legacy Data Retention ............................................................................................................ 4 4 Solution ........................................................................................................................................... 5 4.1 Legacy Data Extraction ............................................................................................................ 5 4.2 Data Cleansing........................................................................................................................
    [Show full text]
  • Research Data Management: Sharing and Storage
    Research Data Management: Strategies for Data Sharing and Storage 1 Research Data Management Services @ MIT Libraries • Workshops • Web guide: http://libraries.mit.edu/data-management • Individual assistance/consultations – includes assistance with creating data management plans 2 Why Share and Archive Your Data? ● Funder requirements ● Publication requirements ● Research credit ● Reproducibility, transparency, and credibility ● Increasing collaborations, enabling future discoveries 3 Research Data: Common Types by Discipline General Social Sciences Hard Sciences • images • survey responses • measurements generated by sensors/laboratory • video • focus group and individual instruments interviews • mapping/GIS data • computer modeling • economic indicators • numerical measurements • simulations • demographics • observations and/or field • opinion polling studies • specimen 4 5 Research Data: Stages Raw Data raw txt file produced by an instrument Processed Data data with Z-scores calculated Analyzed Data rendered computational analysis Finalized/Published Data polished figures appear in Cell 6 Setting Up for Reuse: ● Formats ● Versioning ● Metadata 7 Formats: Considerations for Long-term Access to Data In the best case, your data formats are both: • Non-proprietary (also known as open), and • Unencrypted and uncompressed 8 Formats: Considerations for Long-term Access to Data In the best case, your data files are both: • Non-proprietary (also known as open), and • Unencrypted and uncompressed 9 Formats: Preferred Examples Proprietary Format
    [Show full text]
  • Data Management, Analysis Tools, and Analysis Mechanics
    Chapter 2 Data Management, Analysis Tools, and Analysis Mechanics This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem statement has been formulated, research hypotheses have been stated, data collection planning has been conducted, and data have been collected from various sources (see Volume I for information and details on these phases of research). This chapter discusses how to combine and manage data streams, and how to use data management tools to produce analytical results that are error free and reproducible, once useful data have been obtained to accomplish the overall research goals and objectives. Purpose of Data Management Proper data handling and management is crucial to the success and reproducibility of a statistical analysis. Selection of the appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work. In addition, as the size of databases in transportation continue to grow, it is becoming increasingly important to invest resources into the management of these data. There are a number of ancillary steps that need to be performed both before and after statistical analysis of data. For example, a database composed of different data streams needs to be matched and integrated into a single database for analysis. In addition, in some cases data must be transformed into the preferred electronic format for a variety of statistical packages. Sometimes, data obtained from “the field” must be cleaned and debugged for input and measurement errors, and reformatted. The following sections discuss considerations for developing an overall data collection, handling, and management plan, and tools necessary for successful implementation of that plan.
    [Show full text]
  • Advanced Data Preparation for Individuals & Teams
    WRANGLER PRO Advanced Data Preparation for Individuals & Teams Assess & Refine Data Faster FILES DATA VISUALIZATION Data preparation presents the biggest opportunity for organizations to uncover not only new levels of efciency but DATABASES REPORTING also new sources of value. CLOUD MACHINE LEARNING Wrangler Pro accelerates data preparation by making the process more intuitive and efcient. The Wrangler Pro edition API INPUTS DATA SCIENCE is specifically tailored to the needs of analyst teams and departmental use cases. It provides analysts the ability to connect to a broad range of data sources, schedule workflows to run on a regular basis and freely share their work with colleagues. Key Benefits Work with Any Data: Users can prepare any type of data, regardless of shape or size, so that it can be incorporated into their organization’s analytics eforts. Improve Efciency: Make the end-to-end process of data preparation up to 10X more efcient compared to traditional methods using excel or hand-code. By utilizing the latest techniques in data visualization and machine learning, Wrangler Pro visibly surfaces data quality issues and guides users through the process of preparing their data. Accelerate Data Onboarding: Whether developing analysis for internal consumption or analysis to be consumed by an end customer, Wrangler Pro enables analysts to more efciently incorporate unfamiliar external data into their project. Users are able to visually explore, clean and join together new data Reduce data Improve data sources in a repeatable workflow to improve the efciency and preparation time quality and quality of their work. by up to 90% trust in analysis WRANGLER PRO Why Wrangler Pro? Ease of Deployment: Wrangler Pro utilizes a flexible cloud- based deployment that integrates with leading cloud providers “We invested in Trifacta to more efciently such as Amazon Web Services.
    [Show full text]
  • Preparation for Data Migration to S4/Hana
    PREPARATION FOR DATA MIGRATION TO S4/HANA JONO LEENSTRA 3 r d A p r i l 2 0 1 9 Agenda • Why prepare data - why now? • Six steps to S4/HANA • S4/HANA Readiness Assessment • Prepare for your S4/HANA Migration • Tips and Tricks from the trade • Summary ‘’Data migration is often not highlighted as a key workstream for SAP S/4HANA projects’’ WHY PREPARE DATA – WHY NOW? Why Prepare data - why now? • The business case for data quality is always divided into either: – Statutory requirements – e.g., Data Protection Laws, Finance Laws, POPI – Financial gain – e.g. • Fixing broken/sub optimal business processes that cost money • Reducing data fixes that cost time and money • Sub optimal data lifecycle management causing larger than required data sets, i.e., hardware and storage costs • Data Governance is key for Digital Transformation • IoT • Big Data • Machine Learning • Mobility • Supplier Networks • Employee Engagement • In summary, for the kind of investment that you are making in SAP, get your housekeeping in order Would you move your dirt with you? Get Clean – Data Migration Keep Clean – SAP Tools Data Quality takes time and effort • Data migration with data clean-up and data governance post go-live is key by focussing on the following: – Obsolete Data – Data quality – Master data management (for post go-live) • Start Early – SAP S/4 HANA requires all customers and vendors to be maintained as a business partner – Consolidation and de-duplication of master data – Data cleansing effort (source vs target) – Data volumes: you can only do so much
    [Show full text]