Annex 1 Technology Architecture 1 Source Layer
Total Page:16
File Type:pdf, Size:1020Kb
Annex 1 Technology Architecture The Technology Architecture is the combined set of software, hardware and networks able to develop and support IT services. This is a high-level map or plan of the information assets in an organization, including the physical design of the building that holds the hardware. This annex is intended to be an overview of software packages existing on the market or developed on request in NSIs in order to describe the solutions that would meet NSI needs, implement S-DWH concept and provide the necessary functionality for each S-DWH level. 1 Source layer The Source layer is the level in which we locate all the activities related to storing and managing internal or external data sources. Internal data are from direct data capturing carried out by CAWI, CAPI or CATI while external data are from administrative archives, for example from Customs Agencies, Revenue Agencies, Chambers of Commerce, Social Security Institutes. Generally, data from direct surveys are well-structured so they can flow directly into the integration layer. This is because NSIs have full control of their own applications. Differently, data from others institution’s archives must come into the S-DWH with their metadata in order to be read correctly. In the early days extracting data from source systems, transforming and loading the data to the target data warehouse was done by writing complex codes which with the advent of efficient tools was an inefficient way to process large volumes of complex data in a timely manner. Nowadays ETL (Extract, Transform and Load) is essential component used to load data into data warehouses from the external sources. ETL processes are also widely used in data integration and data migration. The objective of an ETL process is to facilitate the data movement and transformation. ETL is the technology that performs three distinct functions of data movement: o The extraction of data from one or more sources. o The transformations of the data e.g. cleansing, reformatting, standardisation, aggregation. o The loading of resulting data set into specified target systems or file formats. ETL processes are reusable components that can be scheduled to perform data movement jobs on a regular basis. ETL supports massive parallel processing for large data volumes. The ETL tools were created to improve and facilitate data warehousing. Depending on the needs of customers, there are several types of tools. One of them performs and supervises only selected stages of the ETL process like data migration tools (EtL Tools, “small t” tools), data transformation tools (eTl Tools, “capital T” tools). Another are complete (ETL Tools) and have many functions that are intended for processing large amounts of data or more complicated ETL projects. Some of them (like server engine tools) execute many ETL steps at the same time from more than one developer, while other like client engine tools are simpler and execute ETL routines on the same machine as they are developed. There are two more types. The first one is called Code base tools and is a family of programing tools which allow you to work with many operating systems and programing languages. The second one called GUI base tools removes the coding layer and allows you to work without any knowledge (in theory) about coding languages. The first task is data extraction from internal or external sources. After sending queries to the source, system data may go indirectly to the database. However, usually there is a need to monitor or gather more 1 information and then go to staging area. Some tools extract only new or changed information automatically so we don’t have to update it by our own. The second task is transformation which is a broad category: o transforming data into a structure which is required to continue the operation (extracted data has usually a structure typical to the source); o sorting data; o connecting or separating; o cleansing; o checking quality. The third task is loading into a data warehouse. ETL Tools have many other capabilities (next to the main three: extraction, transformation and loading) like for instance sorting, filtering, data profiling, quality control, cleansing, monitoring, synchronization and consolidation. The most popular commercial ETL Tools are: IBM Infosphere DataStage IBM Infosphere DataStage integrates data on demand with a high performance parallel framework, extended metadata management, and enterprise connectivity. It supports the collection, integration and transformation of large volumes of data, with data structures ranging from simple to highly complex. It also provides support for big data and Hadoop, enabling customers to directly access big data on a distributed file system, thereby helping customers address the most challenging data volumes in the systems. It offers in addition a scalable platform that enables customers to solve large-scale business problems through high- performance processing of massive data volumes, as well as supports real-time data integration and completes connectivity between any data source and any application. Informatica PowerCenter Informatica PowerCenter is a widely used extraction, transformation and loading (ETL) tool used in building enterprise data warehouses. PowerCenter empowers its customers to implement a single approach to accessing, transforming, and delivering data without having to resort to hand coding. The software scales to support large data volumes and meets customers’ demands for security and performance. PowerCenter serves as the data integration foundation for all enterprise integration initiatives, including data warehousing, data governance, data migration, service-oriented architecture (SOA), B2B data exchange, and master data management (MDM). Informatica PowerCenter also empowers teams of developers, analysts, and administrators to work faster and better together, sharing and reusing work, to accelerate project delivery. Oracle Warehouse Builder (OWB) Oracle Warehouse Builder (OWB) is a tool that enables designing a custom Business Intelligence application. It provides dimensional ETL process design, extraction from heterogeneous source systems, and metadata reporting functions. Oracle Warehouse Builder allows creation of both dimensional and relational models, and also star schema data warehouse architectures. Except of being an ETL (Extract, Transform, Load) tool, Oracle Warehouse Builder also enables users to design and build ETL processes, target data warehouses, intermediate data storages and user access layers. It allows metadata reading in a wizard-driven form from a data dictionary or Oracle Designer but also supports over 40 metadata files from other vendors. SAS Data Integration Studio SAS Data Integration Studio is a powerful visual design tool for building, implementing and managing data integration processes regardless of data sources, applications, or platforms. An easy-to-manage, multiple- user environment enables collaboration on large enterprise projects with repeatable processes that are 2 easily shared. The creation and management of data and metadata are improved with extensive impact analysis of potential changes made across all data integration processes. SAS Data Integration Studio enables users to quickly build and edit data integration, to automatically capture and manage standardized metadata from any source, and to easily display, visualize, and understand enterprise metadata and your data integration processes. SAS Data Integration Studio is part of the SAS software offering, SAS Enterprise Data Integration Server. SAP Business Objects Data Services (SAP BODS) SAP Business Objects Data Services (SAP BODS) is one of the fundamental capabilities of Data Services. It is used for extracting, transforming, and loading (ETL) data from heterogeneous sources into a target database or data warehouse. Customers can create applications (jobs) that specify data mappings and transformations by using the Designer. Also it empowers users to use any type of data, including structured or unstructured data from databases or flat files to process, cleanse and remove duplicate entries. Data Services RealTime interfaces provide additional support for real-time data movement and access. Data Services RealTime reacts immediately to messages as they are sent, performing predefined operations with message content. Data Services RealTime components provide services to web applications and other client applications. The Data Services product consists of several components including: Designer, Job server, Engine and Repository. Microsoft SQL Server Integration Services (SSIS) Microsoft SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. Integration Services are used to solve complex business problems by copying or downloading files, sending e-mail messages in response to events, updating data warehouses, cleaning and mining data, and managing SQL Server objects and data. The packages can work alone or together with other packages to address complex business needs. Integration Services can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations. Integration Services includes a rich set of built-in tasks and transformations, tools for constructing packages, and the Integration Services service for running and managing packages. You can use the graphical Integration