Study Material for B.Sc.Cs Dataware Housing and Mining Semester - Vi, Academic Year 2020-21
Total Page:16
File Type:pdf, Size:1020Kb
STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21 UNIT CONTENT PAGE Nr I DATA WARE HOUSING 03 II BUSINESS ANALYSIS 10 III DATA MINING 18 IV ASSOCIATION RULE MINING AND CLASSIFICATION 35 V CLUSTER ANALYSIS 53 Page 1 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21 UNIT I: DATA WAREHOUSING Data warehousing Components: ->Overall Architecture Data warehouse architecture is Based on a relational database management system server that functions as the central repository (a central location in which data is stored and managed) for informational data In the data warehouse architecture, operational data and processing is separate and data warehouse processing is separate. Central information repository is surrounded by a number of key components. These key components are designed to make the entire environment- (i) functional, (ii) manageable and (iii) accessible by both the operational systems that source data into warehouse by end-user query and analysis tools. Page 2 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21 The source data for the warehouse comes from the operational applications As data enters the data warehouse, it is transformed into an integrated structure and format The transformation process may involve conversion, summarization, filtering, and condensation of data Because data within the data warehouse contains a large historical component the data warehouse must b capable of holding and managing large volumes of data and different data structures for the same database over time. ->Data Warehouse Database Central data warehouse database is a foundation for data warehousing environment. Marked as (2) in the figure This database is implemented on the relational database management system (RDBMS) technology However, a warehouse implementation based on traditional RDBMS technology is often limited by the fact that traditional RDBMS implementations are optimized for transactional database processing. Certain data warehouse attributes like i) very large database size ii) ad hoc query processing, and iii) the need for flexible user view creation including aggregates, multitable joints, and drill-downs have become drivers for different technological approaches to the data warehouse database. These approaches include: 1) Parallel relational database designs such as- i) symmetric multiprocessors (SMPs)and (ii) massively parallel processors (MPPs) 2) Speeding up a traditional RDBMS by using new index structures to bypass relational table scans. 3) Multidimensional databases (MDDBs) that are based on proprietary database technology or implemented using RDBMS. Designed to overcome limitations. It is paired with on-line analytical processing tools(OLAP) ->Sourcing, Acquisition, Cleanup, and Transformation tools Extract data from operational systems and put it in suitable format Marked as (1) in the figure Performa all tasks required to transform disparate data into information that can be used by the decision support tool It produces the programs and control statements needed to move data into the data warehouse from multiple operational systems It maintains the metadata Page 3 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21 Functionalities- 1) Removing unwanted data from operational databases 2) Converting to common data names and definitions 3) Calculating summaries and derived data 4) Establishing defaults for missing data 5) Accommodating source data definition changes Issues 1) Database heterogeneity- DBMSs are very different 2) Data heterogeneity- Datas differ in definition and used in different models ->Metadata Metadata is data about data that describes the data warehouse Used for building, maintaining, managing, and using the data warehouse Technical metadata- Informative for warehouse designers and administrators 1) Information about data source 2) Transformation description 3) Warehouse object and data structure 4) Rules to perform data cleanup and data enhancement 5) Data mapping operations 6) Access authorization, backup history, archive history, information delivery history, data acquisition history, data access etc Business metadata- information for users to understand easily 1) Subject areas and information object type 2) Internet home pages 3) Other information to support all data warehousing components 4) Data warehouse operational information Metadata management is provided via a metadata repository and accompanying software Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse One important functional component of metadata repository is the information directory From a technical requirements point of view, the information directory should 1) Be a gateway to the data warehouse 2) Be an easy distribution and replication of its content Page 4 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21 3) Be searchable by business-oriented key words 4) Be a platform for end-user data access and analysis tools 5) Support in sharing of information objects 6) Support in scheduling options for requests 7) Support in distribution of the query results 8) Support and provide interfaces to other applications 9) Support end-user monitoring of the status of data warehouse ->Access Tools The principle purpose of data warehousing is to provide information to business users for strategic decision making. The users interact with the data warehouse using font-end tools The end-user tools area spans a number of components Five main groups of the tool are- 1) Data query and reporting tools Can be further divided into- (i)reporting tools and (ii) managed query tools (i)Reporting tools can be divided into- (a)production reporting tools and (b) desktop report writers (a)Production reporting tools generate regular operational regular operational reports or support high volume batch jobs such as calculating and printing pay checks (b)Report writers on the other hand, are inexpensive desktop tools designed for end users (ii) Managed query tools insert a metalayer between user and the database The metalayer is the software that provides subject-oriented views of a database and supports point-and-click creation of SQL 2) Application development tools The application development platforms integrate well with popular OLAP tools It can access all major database systems It includes Oracle, Sybase, and Informix Examples are PowerBuild from PowerSoft, Visual Basic from Microsoft, Forté from Forté Software 3) Executive information system (EIS) tools 4) On-line analytical processing tools On-line analytical processing (OLAP) tools It is based on the concept of multidimensional databases It allows users to analyze the data using elaborate, multidimensional, complex views It also supported by a relational database designed to enable multidimensional database (MRDB) 5) Data mining tools Page 5 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21 Strategic use of data can result from opportunities presented by discovering hidden, previously undetected, and frequently extremely valuable facts about consumers, retailers and suppliers, business trends, and direction and significant factors An organization can formulate effective business, marketing, and sales strategies; precisely target promotional activity; discover and penetrate new markets; and successfully compete in the market place from a position of informed strength A new and promising technology aimed at achieving this strategic advantage is known as data mining Data mining has a huge potential to gain significant benefits in the market place Most organizations engage in data mining to- i. Discover knowledge. The goal of knowledge discovery is to determine explicit hidden relationship, patterns, or correlations from data stored in an enterprise’s database. Specifically, data mining can be used to perform- segmentation, classification, association and preferencing ii. Visual data. Prior to analysis, the goal is to humanize the mass of data to be dealt with and find a clever way to display data iii. Correct data. While consolidating massive databases, many enterprises find that the data is not complete and invariably contains erroneous and contradictory information 6) Data visualization It is the method of presenting the output of all previously mentioned tools in such a way that the entire problem and/or the solution is clearly visible to domain experts and casual observers ->Data Marts A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data The data mart is directed at a partition of data (often called subject area) that is created for the use of a dedicated group of users A data mart might, in fact, be a set of denormalized, summarized or aggregated data The data warehouse architecture may incorporate data mining tools that extract sets of data for a particular type of analysis Data marts whose data content is sourced from the data warehouse are called dependent data marts Independent data marts represent fragmented point solutions to a range of business problems This type of implementation should rarely be deployed Data mart is not necessarily bad