STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

UNIT CONTENT PAGE Nr

I DATA WARE HOUSING 03

II BUSINESS ANALYSIS 10

III DATA MINING 18

IV ASSOCIATION RULE MINING AND CLASSIFICATION 35

V CLUSTER ANALYSIS 53

Page 1 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

UNIT I: DATA WAREHOUSING Data warehousing Components: ->Overall Architecture  Data warehouse architecture is Based on a relational management system server that functions as the central repository (a central location in which data is stored and managed) for informational data  In the data warehouse architecture, operational data and processing is separate and data warehouse processing is separate.  Central information repository is surrounded by a number of key components.  These key components are designed to make the entire environment- (i) functional, (ii) manageable and (iii) accessible by both the operational systems that source data into warehouse by end-user query and analysis tools.

Page 2 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 The source data for the warehouse comes from the operational applications  As data enters the data warehouse, it is transformed into an integrated structure and format  The transformation process may involve conversion, summarization, filtering, and condensation of data  Because data within the data warehouse contains a large historical component the data warehouse must b capable of holding and managing large volumes of data and different data structures for the same database over time. ->Data Warehouse Database  Central data warehouse database is a foundation for data warehousing environment.  Marked as (2) in the figure  This database is implemented on the relational database management system (RDBMS) technology  However, a warehouse implementation based on traditional RDBMS technology is often limited by the fact that traditional RDBMS implementations are optimized for transactional database processing.  Certain data warehouse attributes like i) very large database size ii) ad hoc query processing, and iii) the need for flexible user view creation including aggregates, multitable joints, and drill-downs have become drivers for different technological approaches to the data warehouse database.  These approaches include: 1) Parallel relational database designs such as- i) symmetric multiprocessors (SMPs)and (ii) massively parallel processors (MPPs) 2) Speeding up a traditional RDBMS by using new index structures to bypass relational table scans. 3) Multidimensional (MDDBs) that are based on proprietary database technology or implemented using RDBMS. Designed to overcome limitations. It is paired with on-line analytical processing tools(OLAP) ->Sourcing, Acquisition, Cleanup, and Transformation tools  Extract data from operational systems and put it in suitable format  Marked as (1) in the figure  Performa all tasks required to transform disparate data into information that can be used by the decision support tool  It produces the programs and control statements needed to move data into the data warehouse from multiple operational systems  It maintains the metadata

Page 3 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Functionalities- 1) Removing unwanted data from operational databases 2) Converting to common data names and definitions 3) Calculating summaries and derived data 4) Establishing defaults for missing data 5) Accommodating source data definition changes

 Issues 1) Database heterogeneity- DBMSs are very different 2) Data heterogeneity- Datas differ in definition and used in different models ->Metadata  Metadata is data about data that describes the data warehouse  Used for building, maintaining, managing, and using the data warehouse  Technical metadata- Informative for warehouse designers and administrators 1) Information about data source 2) Transformation description 3) Warehouse object and data structure 4) Rules to perform data cleanup and data enhancement 5) Data mapping operations 6) Access authorization, history, archive history, information delivery history, data acquisition history, data access etc  Business metadata- information for users to understand easily 1) Subject areas and information object type 2) Internet home pages 3) Other information to support all data warehousing components 4) Data warehouse operational information  Metadata management is provided via a metadata repository and accompanying software  Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse  One important functional component of metadata repository is the information directory  From a technical requirements point of view, the information directory should 1) Be a gateway to the data warehouse 2) Be an easy distribution and replication of its content

Page 4 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

3) Be searchable by business-oriented key words 4) Be a platform for end-user data access and analysis tools 5) Support in sharing of information objects 6) Support in scheduling options for requests 7) Support in distribution of the query results 8) Support and provide interfaces to other applications 9) Support end-user monitoring of the status of data warehouse ->Access Tools  The principle purpose of data warehousing is to provide information to business users for strategic decision making.  The users interact with the data warehouse using font-end tools  The end-user tools area spans a number of components  Five main groups of the tool are- 1) Data query and reporting tools  Can be further divided into- (i)reporting tools and (ii) managed query tools  (i)Reporting tools can be divided into- (a)production reporting tools and (b) desktop report writers  (a)Production reporting tools generate regular operational regular operational reports or support high volume batch jobs such as calculating and printing pay checks  (b)Report writers on the other hand, are inexpensive desktop tools designed for end users  (ii) Managed query tools insert a metalayer between user and the database  The metalayer is the software that provides subject-oriented views of a database and supports point-and-click creation of SQL 2) Application development tools  The application development platforms integrate well with popular OLAP tools  It can access all major database systems  It includes Oracle, Sybase, and Informix  Examples are PowerBuild from PowerSoft, Visual Basic from Microsoft, Forté from Forté Software 3) Executive information system (EIS) tools 4) On-line analytical processing tools  On-line analytical processing (OLAP) tools  It is based on the concept of multidimensional databases  It allows users to analyze the data using elaborate, multidimensional, complex views  It also supported by a relational database designed to enable multidimensional database (MRDB) 5) Data mining tools

Page 5 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Strategic use of data can result from opportunities presented by discovering hidden, previously undetected, and frequently extremely valuable facts about consumers, retailers and suppliers, business trends, and direction and significant factors  An organization can formulate effective business, marketing, and sales strategies; precisely target promotional activity; discover and penetrate new markets; and successfully compete in the market place from a position of informed strength  A new and promising technology aimed at achieving this strategic advantage is known as data mining  Data mining has a huge potential to gain significant benefits in the market place  Most organizations engage in data mining to- i. Discover knowledge. The goal of knowledge discovery is to determine explicit hidden relationship, patterns, or correlations from data stored in an enterprise’s database. Specifically, data mining can be used to perform- segmentation, classification, association and preferencing ii. Visual data. Prior to analysis, the goal is to humanize the mass of data to be dealt with and find a clever way to display data iii. Correct data. While consolidating massive databases, many enterprises find that the data is not complete and invariably contains erroneous and contradictory information 6) Data visualization  It is the method of presenting the output of all previously mentioned tools in such a way that the entire problem and/or the solution is clearly visible to domain experts and casual observers ->Data Marts  A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data  The data mart is directed at a of data (often called subject area) that is created for the use of a dedicated group of users  A data mart might, in fact, be a set of denormalized, summarized or aggregated data  The data warehouse architecture may incorporate data mining tools that extract sets of data for a particular type of analysis  Data marts whose data content is sourced from the data warehouse are called dependent data marts  Independent data marts represent fragmented point solutions to a range of business problems  This type of implementation should rarely be deployed  Data mart is not necessarily bad and is often necessary in some cases like- 1) Extremely urgent user requirements

Page 6 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

2) No budget for a full data warehouse strategy 3) No sponsor for an enterprise wide decision support strategy 4) Decentralization of business units 5) Attraction of easy-to-use tools and mind sized projects  Data integration issues associated with data marts for any two data marts in an enterprise the common decision must satisfy the equality and roll-up rule  The rule states that these dimensions are either same or that one is strict roll-up of another  Data marts represent two problems: (i) scalability in situations where an initial small data mart grows quickly in multiple dimensions and (ii) data integration  The key to a successful data mart strategy is development of data warehouse architecture  Key step in architecture is identifying and implementing the common dimensions ->Data Warehouse Administration and Management  Security and priority management  Monitoring updates from multiple sources  Data quality checks  Managing and updating metadata

 Auditing and reporting data warehouse usage and status  Purging data  Replicating, sub setting and distributing data  Backup and recovery  Data warehouse storage management ->Information Delivery System  The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations  Delivery of information may be based on time of day, or on a completion of an external event  Once the data warehouse is installed and operational, its users don’t have to be aware of its location and maintenance  Web-enabled information delivery system that allows users dispersed across continents to perform a sophisticated business-critical analysis, and to engage in collective decision making that is based on timely and valid information

Building a Data warehouse: Business Considerations: Return on Investment

Page 7 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 The data warehouse varies with the business requirements, business priorities, and even magnitude of the problem  A specific subject area such as a warehouse is expressly designed to solve business problems related to personnel  The individual warehouses are known as data marts  Two approaches are- 1) The top-down approach- meaning an organization decided to build and enterprise data warehouse with subset data marts 2) The bottom-up approach- implying that the business priorities resulted in developing individual data marts, which are then integrated into the enterprise warehouse The bottom-up approach is more realistic  Organizational issues- an organization will need to employ different development practices than the ones it uses for operational applications

-> Design Considerations  A data warehouse designer must adopt a holistic approach (consider all data warehouse components as parts of a single complex system and take into account all possible data sources and all known usage requirements)  A data warehouse’s design point is to consolidate data from multiple, often heterogeneous, sources into a query database

 The main factors include: i) Heterogeneity of data sources ii) Use of historical data iii) Tendency of databases to grow very large  The data warehouse design is different from traditional OLTP  Data warehouse is business-driven and requires continuous interactions with end users and is never finished since both requirements and data sources change  Data content- . The content and structure of data warehouse are reflected in its data model . The data model is the template that describes how information will be organized within the integrated warehouse framework

Page 8 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

. A designer should remember that decision support queries, because of their broad scope and analytical intensity, require data models to be optimized to improve query performance . The key point is that in a dependent data mart environment, the data mart data is cleaned up, is transformed, and is consistent with the data warehouse and other data marts sourced from the same warehouse  Metadata- . Metadata defines the content and location of data in the warehouse, relationship between the operational databases and the data warehouse, and the business views of the warehouse data that are accessible by end-user tools . Metadata is searched by users to find data definitions or subject areas . All access paths to the data warehouse have metadata as an entry point  Data distribution- . As data volumes continue to grow, the database size may rapidly outgrow a single server . It is necessary that data should be divided across multiple servers, and which users should get access to which types of data . The data placement and distribution design should consider several options including data distribution by subject area, location, or time  Tools- . Tools provide facilities for defining the transformation and cleanup rules, data movement, end-user query, reporting and data analysis . Each tool takes a slightly different approach to data warehousing and often maintains its own version of the metadata . It is placed in a tool specific, proprietary metadata repository . The tools should be able to source the metadata from the warehouse data dictionary or from a CASE tool used to design the warehouse database

 Performance consideration- . Actual performance levels are business-dependent and vary widely from one environment to another . It is relatively difficult to predict the performance of a typical data warehouse. One reason is the unpredictable usage patterns against the data . One design technique is to populate the warehouse with a number of denormalized views containing summarized, derived, and aggregated data

Page 9 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

. If done correctly, end-user queries may execute directly against these views, thus maintaining overall performance levels

-> Benefits of Data Warehousing  Locating the right information  Presentation of information  Testing of hypothesis  Discovery of information  Sharing the analysis

 Tangible benefits- i) Product inventory turnover is improved ii) Costs of product introduction are decreased iii) More cost-effective decision making enabled iv) Better business intelligence is enabled by increased quality and flexibility of market analysis available through multilevel data structures, which may range from detailed to highly summarized v) Enhanced asset and liability management  Intangible benefits- i) Improved productivity by keeping all required data in a single location and eliminating the rekeying of data ii) Reduced redundant processing, support and software to support overlapping decision support applications iii) Enhanced customer relations through improved knowledge of individual requirements and trends, customization, communication and product iv) Enabling business process reengineering- data warehouse can provide insights into the work processing resulting in developing idea for reengineering of those processes

Page 10 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

UNIT II: BUSINESS ANALYSIS ->Tools categories There are five categories of decision support tools  Reporting  Managed query  Executive information systems  On-line analytical processing  Data mining - Reporting tools  Reporting tools can be divided into production reporting tools and desktop reporting writers  Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs  Production reporting tools include third-generation languages, specialized fourth- generation languages and high-end client/server tool  Report writers are inexpensive data tools designed for end users  Seagate Software’s Crystal Reports let users design and run reports without having to rely on the IS department  Report writers include Crystal reports, Actuate Software Corp.’s Actuate Reporting System, IQ software Corp.’s IQ Objects, and Platinum Technology, Inc.’s InfoReports - Managed query tools  Managed query tools shied end users from the complexities of SQL and database structures by inserting a metalayer between users and the database  Metalayer: Software that provides subject oriented views of a database and supports point and click creation of SQL  Business Objects, Inc., call this layer a “universe” -Executive information system tool  Executive information system (EIS) tool predate report writers and managed query tools  First deployed on main frame system  Build customized, graphical decision support apps or briefing books  Popular EIS tools include Pilot Software, Inc.’s Lightshift, Platinum technology’s Forest and Trees, Comshare, Inc.’s Commander Decision, Oracle’s Express Analyzer, and SAS Institute, Inc.’s SAS/EIS -OLAP tools

Page 11 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Provide an intuitive way to view corporate data  Provide navigation through the hierarchies and dimensions with the single click  Aggregate data along common business subjects or dimensions  Users can drill down, across, or up levels in each dimensions or pivot and swap out dimensions to change their view of the data  Arbour Software Corp.’s Essbase and Oracle’s Express, preaggregate data in special multidimensional databases -Data mining tools  Provide insights into corporate data that are not easily discerned with managed query or OLAP tools.  Use variety of statistical and Artificial Intelligence algorithm to analyze the correlation of variables in data and uncover interesting patterns and relationships to investigate  Tools such as IBM’s Intelligent Miner, are expensive and require statisticians to implement and manage  These tools include DataMind Corp.’s DataMind, Pilot’s Discovery Server, and tools from Business Objects and SAS Institute ->The Need for Applications  Some tools and apps can format the retrieved data into easy-to-read reports, while others concentrate on the on-screen presentation As the complexity of questions grows this tools may rapidly become Inefficient Consider various access types to the data stored in a data warehouse  Simple tabular form reporting  Ad hoc user specified queries  Predefined repeatable queries  Complex queries with multi table joins, multilevel sub queries, and sophisticated search Criteria.  Ranking  Multivariable analysis  Time series analysis  Data visualization, graphing, charting and pivoting  Complex textual search  Statistical analysis  Interactive Drill down reporting an analysis  AI techniques for testing of hypothesis  Information Mapping  Interactive drill-down reporting and analysis. The first four types of access are covered by the combine category of tools

Page 12 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

called query and reporting tools. There are three types of reporting  Creation and viewing of standard reports  Definition and creation of ad hoc reports  Data exploration. ->Need of OLAP  The key driver of OLAP is the multidimensional nature of the business problem.  These problems are characterized by retrieving a very large number of records that can reach gigabytes and terabytes and summarizing this data into form information that can be used by business analysts.  One of the limitations that SQL has, it cannot represent these complex problems.  A query will be translated in to several SQL statements. These SQL statements will involve multiple joins, intermediate tables, sorting, aggregations and a huge temporary memory to store these tables.  These procedures required a lot of computation which will require a long time in computing.  The second limitation of SQL is its inability to use mathematical models in these SQL statements. If an analyst, could create these complex statements using SQL statements, still there will be a large number of computation and huge memory needed.  Therefore the use of OLAP is preferable to solve this kind of problem. ->Multidimensional Data Model  The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP  Multidimensional data model is to view it as a cube. The cable at the left contains detailed sales data by product, market and time. The cube on the right associates sales number (unit sold) with dimensions-product type, market and time with the unit variables organized as cell in an array.  This cube can be expanded to include another array-price-which can be associates with all Or only some dimensions. As number of dimensions increases number of cubes cell increase exponentially.  Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months, weak and day. GEOGRAPHY may contain country, state, city etc.

Page 13 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Figure 2.1 multidimensional data cube

Figure 2.2 expanded data cube

->OLAP Guidelines  Dr. E.F. Codd ― the father of the relational model, created a list of rules to deal with the OLAP systems. Users should priorities these rules according to their needs to match their business requirements. These rules are: 1) Multidimensional conceptual view: The OLAP should provide an appropriate multidimensional Business model that suits the Business problems and Requirements.

Page 14 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

2) Transparency: The OLAP tool should provide transparency to the input data for the users. 3) Accessibility: The OLAP tool should only access the data required only to the analysis needed. 4) Consistent reporting performance: The Size of the database should not affect in any way the performance. 5) Client/server architecture: The OLAP tool should use the client server architecture to ensure better performance and flexibility. 6) Generic dimensionality: Data entered should be equivalent to the structure and operation requirements. 7) Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse matrix and so maintain the level of performance. 8) Multi-user support: The OLAP should allow several users working concurrently to work together. 9) Unrestricted cross-dimensional operations: The OLAP tool should be able to perform operations across the dimensions of the cube. 10) Intuitive data manipulation. ―Consolidation path re-orientation, drilling down and roll-up and other manipulations should be accomplished via direct action upon the cells of the cube. 11) Flexible reporting: It is the ability of the tool to present the row and column in a manner suitable to be analyzed. 12) Unlimited dimensions and aggregation levels: This depends on the kind of Business, where multiple dimensions and defining hierarchies can be made. In addition to these guidelines an OLAP system should also support:  Comprehensive database management tools: This gives the database management to control distributed Businesses.  The ability to drill down to detail source record level: Which requires that The OLAP tool should allow smooth transitions in the multidimensional database.  Incremental database refresh: The OLAP tool should provide partial refresh.  Structured (SQL interface): the OLAP system should be able to integrate effectively in the surrounding enterprise environment.

->Multidimensional versus Multirelational OLAP  Multidimensional database systems are referred to as multirelational database systems

Page 15 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 To achieve the required speed, these products use the star or snowflake schemas-specially optimized and denormalized data models that involve data restructuring and aggregation  The snowflake schema is an extension of the star schema that supports multiple fact tables and joins between them  One benefit of the star schema approach is reduced complexity in the data model, which increases data “legibility”, making it easier for users to pose business questions of OLAP nature

->Categorization of OLAP Tools  OLAP is used to categorize data using elaborate, multidimensional, complex views  The two classes of OLAP tools are compared as shown in the figure

Application complexity

MOLAP

ROLAP Application performance

 MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. That is, data stored in array-based structures. Advantages:  Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations.  Can perform complex calculations: All calculations have been pre-generated

Page 16 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages:  Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.  Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

Figure 2.4 MOLAP architecture

 ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a ―WHERE‖ clause in the SQL statement. Data stored in relational tables. Advantages:  Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.  Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages:  Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

Page 17 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of- the-box complex functions as well as the ability to allow users to define their own functions.

Figure 2.5 ROLAP architecture

 HOLAP HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary type information, HOLAP leverages cube technology for faster performance. It stores only the indexes and aggregations in the multidimensional form while the rest of the data is stored in the relational database.

->OLAP Tools and the Internet  The mainly comprehensive premises in computing have been the internet and data warehousing thus the integration of these two giant technologies is a necessity. The advantages of using the Web for access are inevitable. These advantages are:  The internet provides connectivity between countries acting as a free resource.  The web eases administrative tasks of managing scattered locations.  The Web allows users to store and manage data and applications on servers that can be managed, maintained and updated centrally.

 These reasons indicate the importance of the Web in data storage and manipulation. The Web enabled data access has many significant features, such as:  The first  The second

Page 18 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 The emerging third  HTML publishing  Helper applications  Plug-ins  Server-centric components  Java and active-x applications

 The primary key in the decision making process is the amount of data collected and how well this data is interpreted. Nowadays, Managers aren’t satisfied by getting direct answers to their direct questions, Instead due to the market growth and increase of clients their questions became more complicated. Questions are like how much profit from selling our products at our different centers per month. A complicated question like this isn’t as simple to be answered directly; it needs analysis to three fields in order to obtain an answer.

Page 19 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

UNIT III : DATA MINING Introduction  Data mining refers to extracting or ―mining knowledge from large amounts of data.  Similar meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, another popularly used term, Knowledge Discovery from Data, or KDD. Why Data Mining?  Moving toward the Information age Information age or Data age the data is of Terabytes or Petabytes. Business data sets including Sales transactions, stock trading records, sales promotions, company profiles and customer feedback. Communities and social media have become important data sources  Data Mining as the evolution of Information Technology

 1960s: Data collection, database creation, IMS and network DBMS  1970s: Relational data model, relational DBMS implementation  1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) . Application-oriented DBMS ( spatial, scientific, engineering, etc.)  1990s: Data mining, data warehousing, multimedia databases, and Web databases  2000s: Stream data management and mining . Data mining and its applications . Web technology ( XML, data integration) and global information systems

What is Data Mining?  Essential step in the Process of knowledge discovery.

Page 20 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Figure: Data mining as a process of knowledge discovery.

 Knowledge discovery as a process is depicted in Figure consists of an iterative sequence of the following steps:  Data cleaning: to remove noise and inconsistent data  Data integration: where multiple data sources may be combined  Data selection: where data relevant to the analysis task are retrieved from the database  Data transformation: where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations  Data mining: an essential process where intelligent methods are applied in order to extract data patterns  Pattern evaluation to identify the truly interesting patterns representing knowledge based on some interestingness measures  Knowledge presentation where visualization and knowledge representation techniques are used to present the mined knowledge to the user.

 The architecture of a typical data mining system may have the following major components- (i)Database, (ii)data warehouse, (iii)Worldwide Web, or (iv)other information repository  Data cleaning and data integration techniques may be performed on the data  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search

Page 21 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

or evaluate the interestingness of resulting patterns.  Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.  Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).

Types of data (Kinds of data)  Databases Data:  A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.  A relational database:  It is a collection of tables, each of which is assigned a unique name  Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).  Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.  A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases.  An ER data model represents the database as a set of entities and their relationships.  Data Warehouses:  A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.  Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.

Page 22 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 The data are stored to provide information from a historical perspective (such as from the past 5–10 years) and are typically summarized.  A data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure, such as count or sales amount  The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube.  A data cube provides a multidimensional view of data and allows the pre computation and fast accessing of summarized data.  Data warehouse systems are well suited for on-line analytical processing, or OLAP.  OLAP operations use background knowledge regarding the domain of the data being studied in order to allow the presentation of data at different levels of abstraction

Page 23 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at differing degrees of summarization.

 Transactional Data:  Transactional database consists of a file where each record represents a transaction.  A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction  The transactional database may have additional tables associated with it which contain other information

 Advanced Data and Information Systems and Advanced Applications  The new database applications include handling spatial data, engineering design data, hypertext and multimedia data, time related data, stream data, and the WorldWideWeb  These applications require efficient data structures and scalable methods for handling complex object structures; variable-length records; semi structured or unstructured data; text, spatiotemporal, and multimedia data; and database schemas with complex structures and dynamic changes.  Object-Relational Databases:  Object-relational databases are constructed based on an object-relational data model.  This model extends the relational model by providing a rich data type for handling complex objects and object orientation object-relational databases are becoming increasingly popular in industry and applications.  The object-relational data model inherits the essential concepts of object-oriented databases  Each object has associated with it the following: 1) A set of variables that describe the objects. 2) A set of messages that the object can use to communicate with other objects, or with the rest of the database system. 3) A set of methods, where each method holds the code to implement a message.  Temporal Databases, Sequence Databases, and Time-Series Databases  A temporal database typically stores relational data that include time-related attributes.  These attributes may involve several timestamps, each having different semantics.  A sequence database stores sequences of ordered events, with or without a concrete notion of time.  Examples include customer shopping sequences, Web click streams, and biological

Page 24 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

sequences.  A time series database stores sequences of values or events obtained over repeated measurements of time.  Examples include data collected from the stock exchange, inventory control, and the observation of natural phenomena.  Spatial Databases and Spatiotemporal Databases  Spatial databases contain spatial-related information.  Examples include geographic (map) databases, very large-scale integration (VLSI) or computer-aided design databases, medical and satellite image databases.  Spatial data may be represented in raster format, consisting of n- dimensional bit maps or pixel maps.  “What kind of data mining can be performed on spatial databases?” you may ask. Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, such as a park, for instance.  A spatial database that stores spatial objects that change with time is called a spatiotemporal database, from which interesting information can be mined.  Text Databases and Multimedia Databases  Text databases are databases that contain word descriptions for objects.  These words descriptions are usually not simple keywords but rather long sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary reports, notes, or other documents.  Text databases may be highly unstructured  Some text databases may be somewhat structured, that is, semi structured, whereas others are relatively well structured.  Text databases with highly regular structures typically can be implemented using relational database systems.  “What can data mining on text databases uncover?” By mining text data, one may uncover general and concise descriptions of the text documents, keyword or content associations, as well as the clustering behavior of text objects.  Multimedia databases store image, audio, and video data. They are used in applications such as picture content-based retrieval, voice-mail systems, video- on-demand systems, the World Wide Web, and speech-based user interfaces that recognize spoken commands.  Multimedia databases must support large objects, because data objects such as video can require gigabytes of storage.  Specialized storage and search techniques are also required.  Because video and audio data require real-time retrieval at a steady and predetermined rate in order to avoid picture or sound gaps and system buffer overflows, such data are referred to as continuous-media data.  Heterogeneous Databases and Legacy Databases  A heterogeneous database consists of a set of interconnected, autonomous component databases.  The components communicate in order to exchange information and answer queries.  Objects in one component database may differ greatly from objects in other component databases, making it difficult to assimilate their semantics into the overall

Page 25 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

heterogeneous database.  Data Streams  Many applications involve the generation and analysis of a new kind of data, called stream data, where data flow in and out of an observation platform dynamically.  Such data streams have the following unique features: huge or possibly infinite volume, dynamically changing, flowing in and out in a fixed order, allowing only one or a small number of scans, and demanding fast (often real-time) response time.  Typical examples of data streams include various kinds of scientific and engineering data, timeseries data, and data produced in other dynamic environments, such as power supply, network traffic, stock exchange, telecommunications, Web click streams, video surveillance, and weather or environment monitoring.  Mining data streams involves the efficient discovery of general patterns and dynamic changes within stream data.  The World Wide Web  The World Wide Web and its associated distributed information services, such as Yahoo!, Google, America Online, and AltaVista, provide rich, worldwide, on-line information services, where data objects are linked together to facilitate interactive access.  Users seeking information of interest traverse from one object via links to another. Such systems provide ample opportunities and challenges for data mining.  For example, understanding user access patterns will not only help improve system design, but also leads to better marketing decisions  Capturing user access patterns in such distributed information environments is called Web usage mining (or Weblog mining).

 Data Mining Functionalities  Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.  Data mining tasks can be classified into two categories: descriptive and predictive.  Descriptive mining tasks characterize the general properties of the data in the database.  Predictive mining tasks perform inference on the current data in order to make predictions.

 Concept/Class Description: Characterization and Discrimination  Data can be associated with classes or concepts.  It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms.  Such descriptions of a class or a concept are called class/concept descriptions.  These descriptions can be derived via

Page 26 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

1) data characterization, by summarizing the data of the class under study (often called the target class) in general terms. 2) data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or 3) both data characterization and discrimination.  Data characterization is a summarization of the general characteristics or features of a target class of data.  The data corresponding to the user-specified class are typically collected by a database query the output of data characterization can be presented in various forms.  Examples include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.  Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes.  The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries.  Discrimination descriptions expressed in rule form are referred to as discriminate rules.

 Mining Frequent Patterns, Associations, and Correlations  Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including item sets, subsequences, and substructures.  A frequent item set typically refers to a set of items that frequently appear together in a transactional data set, such as Computer and Software.  A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a sequential pattern.  Classification and Prediction  Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.  The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).  A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.  Decision trees can easily be converted to classification rules  A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units.

Page 27 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 There are many other methods for constructing classification models, such as naïve Bayesian classification, support vector machines, and k-nearest neighbor classification.  Whereas classification predicts categorical (discrete, unordered) labels, prediction models Continuous-valued functions.  That is, it is used to predict missing or unavailable numerical data values rather than class labels.

 Cluster Analysis  Classification and prediction analyze class-labeled data objects, where as clustering analyzes data objects without consulting a known class label.

 Outlier Analysis  A database may contain data objects that do not comply with the general behavior or model of the data.  These data objects are outliers.  Most data mining methods discard outliers as noise or exceptions.  However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.  The analysis of outlier data is referred to as outlier mining.

 Interestingness of Patterns  A data mining system has the potential to generate thousands or even millions of patterns, or rules.  Then “are all of the patterns interesting?” Typically not—only a small fraction of the patterns potentially generated would actually be of interest to any given user.  A pattern is interesting if it is 1) easily understood by humans, 2) valid on new or test data with some degree of certainty, 3) potentially useful, and 4) novel. An interesting pattern represents knowledge. The objective measures of pattern interestingness are  Support support( XY)=P(X UY)  This is taken to be the probability P(XUY),where XUY indicates that a transaction contains both X and Y, that is, the union of itemsets X and Y.  Another objective measure for association rules is confidence, which assesses the

Page 28 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

degree of certainty of the detected association.  This is taken to be the conditional probability P(Y | X), that is, the probability that a transaction containing X also contains Y.  Confidence confidence(X Y) =P(Y| X)  In general, each interestingness measure is associated with a threshold, which may be controlled by the user. The subjective measures of pattern interestingness are based on user beliefs in the data.  Patterns are unexpected and they are actionable.  Patterns are expected, if they confirm a hypothesis to validate.

 Classification of Data Mining Systems  Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science  Depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high- performance computing.  Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology.

 Data Mining Task Primitives

Page 29 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Statistics The study of collection, analysis, interpretation, and presentation of data can be done using Statistical Models. For example, statistics can be used to model noise and missing data, and then this model can be used in large data set to identify the noise and missing values in data. Machine Learning ML is used to improve performance based on data. The main research area is for computer programs to automatically learn to recognize complex patterns and make intelligent decisions based on the data. Machine Learning focuses on accuracy and data mining focuses on the efficiency and scalability of mining methods on the large data set, complex data, etc. Three types: 1. Supervised Learning: The target data set is known and the machine is trained according to the target values. 2. Unsupervised Learning: The target values are not known and the machines learn by themselves. 3. Semi-Supervised Learning: It uses both the techniques of supervised and unsupervised learning. 4. Active learning: It is a machine learning approach that let users play active role in the learning process.

Database Systems and Data Warehouse: Database systems research focuses on the creation, maintenance, and use of databases for organizations and end-users. Data mining tasks handle large data sets of real-time, fast streaming data. A data warehouse integrates data originating from multiple sources and various timeframes. The data cube model facilitates OLAP in multidimensional databases. Information Retrieval (IR): It is the science of searching for documents or information in documents. It uses two principles:

Page 30 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Data that is to be searched is unstructured.  The queries are formed mainly by keywords. By using data analysis and IR, themajor topics in the collection of documents and also the major topics involved in each document.  A data mining query is defined in terms of the following primitives 1) Task-relevant data: This is the database portion to be investigated 2) The kinds of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association, classification, clustering, or evolution analysis. 3) Background knowledge: Users can specify background knowledge, or knowledge about the domain to be mined. 4) Interestingness measures: These functions are used to separate uninteresting patterns from knowledge. 5) Presentation and visualization of discovered patterns: This refers to the form in which discovered patterns are to be displayed.  Integration of a Data Mining System with a Data Warehouse  DB and DW systems, possible integration schemes include no coupling, loose coupling, semi tight coupling, and tight coupling.  The data mining subsystem is treated as one functional component to information system.  Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.

 Major Issues in Data Mining Mining Methodology  As there are diverse applications, new mining tasks continue to emerge. These tasks can use the same database in different ways and require the development of new data mining techniques.  While searching for knowledge in large datasets, we need to explore multidimensional space. To find interesting patterns, various combinations of dimensions need to be applied.  Uncertain, noisy and incomplete data can sometimes lead to erroneous derivation. User Interaction  The data analyzing process should be highly interactive. It is important for facilitating the mining process to be user interactive.  The domain knowledge, background knowledge, constraints, etc., should all be incorporated in the data mining process.  The knowledge discovered by mining the data should be usable for humans. The system should adopt an expressive representation of knowledge, user-friendly visualization techniques, etc.

Page 31 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Efficiency And Scalability  Data mining algorithms should be efficient and scalable to effectively extract interesting data from a huge amount of data in the data repositories.  Wide distribution of data, complexity in computation motivates the development of parallel and distributed data-intensive algorithms. Diversity of Database Types  The construction of effective and efficient data analysis tools for diverse applications, wide spectrum of data types from unstructured data, temporal data, hypertext, multimedia data, and software program code remains a challenging and active area of research.  Handling complex types of data  Mining dynamic, networked and global data repositories Data Ming and Society  The disclosure to use the data and the potential violation of individual privacy and protection of rights are the areas of concern that need to be addressed.  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining

Data Preprocessing  Why Preprocess the Data? . Measures for data quality: A multidimensional view . Accuracy: correct or wrong, accurate or not . Completeness: not recorded, unavailable . Consistency: some modified but some not, dangling . Timeliness: timely update? . Believability: how trustable the data are correct? . Interpretability: how easily the data can be understood?

 Data Cleaning . Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies . Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. (i). Missing values 1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves classification or description). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. 2. Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.

Page 32 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown". If missing values are replaced by, say, “Unknown", then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common - that of ‘’Unknown". Hence, although this method is simple, it is not recommended. 4. Use the attribute mean to fill in the missing value: For example, suppose that the average income of All Electronics customers is $28,000. Use this value to replace the missing value for income. 5. Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. 6. Use the most probable value to fill in the missing value: This may be determined with inference- based tools using a Bayesian formalism or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.

(ii). Noisy data Noise is a random error or variance in a measured variable. 1. Binning methods: I. Binning methods smooth a sorted data value by consulting the “neighborhood", or values around it. II. The sorted values are distributed into a number of 'buckets', or bins. III. Because binning methods consult the neighborhood of values, they perform local smoothing. IV. In this example, the data for price are first sorted and partitioned into equi-depth bins of depth 3. In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. V. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

(i)Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 (ii)Partition into (equi-width) bins: • Bin 1: 4, 8, 15 • Bin 2: 21, 21, 24 • Bin 3: 25, 28, 34

Page 33 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

(iii).Smoothing by bin means: - Bin 1: 9, 9, 9, - Bin 2: 22, 22, 22 - Bin 3: 29, 29, 29 (iv).Smoothing by bin boundaries: • Bin 1: 4, 4, 15 • Bin 2: 21, 21, 24 • Bin 3: 25, 25, 34

2. Clustering: I. Outliers may be detected by clustering, where similar values are organized into groups or clusters. II. Intuitively, values which fall outside of the set of clusters may be considered outliers. 3. Combined computer and human inspection: I. Outliers may be identified through a combination of computer and human inspection. II. In one application, for example, an information-theoretic measure was used to help identify outlier patterns in a handwritten character database for classification. III. The measure's value reflected the “surprise" content of the predicted character label with respect to the known label. Outlier patterns may be informative or “garbage". Patterns whose surprise content is above a threshold are output to a list. A human can then sort through the patterns in the list to identify the actual garbage ones. 4. Regression: I. Data can be smoothed by fitting the data to a function, such as with regression. II. Linear regression involves finding the ―best" line to fit two variables, so that one variable can be used to predict the other. III. Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to a multidimensional surface.

Data Integration . Integration of multiple databases, data cubes, or files . Combines data from multiple sources into a coherent store . Schema integration: e.g., A.cust-id  B.cust-# o Integrate metadata from different sources . Entity identification problem: o Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

Page 34 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

. Detecting and resolving data value conflicts o For the same real world entity, attribute values from different sources are different o Possible reasons: different representations, different scales, e.g., metric vs. British units Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis and covariance analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population

Chi-Square Calculation: An Example Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) (25090)2 (50 210)2 (200360)2 (1000840)2  2      507.93 90 210 360 840

 It shows that like_science_fiction and play_chess are correlated in the group

Correlation Analysis (Numeric Data)

Page 35 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  rA,B = 0: independent; rAB < 0: negatively correlated  Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 Thus, A and B rise together since Cov(A, B) > 0. Data Reduction . Dimensionality reduction . Numerosity reduction . Data compression Data Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values . Normalization . Concept hierarchy generation  Methods  Smoothing: Remove noise from data  Attribute/feature construction • New attributes constructed from the given ones  Aggregation: Summarization, data cube construction  Normalization: Scaled to fall within a smaller, specified range • min-max normalization • z-score normalization • normalization by decimal scaling

Page 36 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Discretization: Concept hierarchy climbing  Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0. There are three main methods for data normalization : min-max normalization, z-score normalization, and normalization by decimal scaling.

(i).Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute A. Min-max normalization maps a value v of A to v0 in the range [new minA; new maxA] by computing

(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are normalized based on the mean and standard deviation of A. A value v of A is normalized to v0 by computing where mean A and stand dev A are the mean and standard deviation, respectively, of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers which dominate the min-max normalization.

(iii). Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value v of A is normalized to v0by computing where j is the smallest integer such that

UNIT IV ASSOCIATION RULE MINING AND CLASSIFICATION

4.1. Mining Frequent Patterns, Associations and Correlations: 1) Basic Concepts: Frequent Pattern Analysis  Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining  Motivation: Finding inherent regularities in data  What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?

Page 37 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Can we automatically classify web documents?  Applications  Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

 Basic Concepts: Frequent Patterns  itemset: A set of one or more items

 k-itemset X = {x1, …, xk}  (absolute) support, or, support count of X: Frequency or occurrence of an itemset X  (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)  An itemset X is frequent if X’s support is no less than a minsup threshold Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

Page 38 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Basic Concepts: Association Rules  Find all the rules X  Y with minimum support and confidence . support, s, probability that a transaction contains X  Y . confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer: 3, Nuts: 3, Diaper: 4, Eggs: 3, {Beer, Diaper}: 3  Association rules: (many more!) . Beer  Diaper (60%, 100%) . Diaper  Beer (60%, 75%)  Frequent Itemset Mining Methods  Apriori: A Candidate Generation-and-Test Approach  Improving the Efficiency of Apriori  FPGrowth: A Frequent Pattern-Growth Approach  ECLAT: Frequent Pattern Mining with Vertical Data Format The Downward Closure Property and Scalable Mining Methods  The downward closure property of frequent patterns . Any subset of a frequent itemset must be frequent . If {beer, diaper, nuts} is frequent, so is {beer, diaper} . i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Scalable mining methods: Three major approaches . Apriori . Freq. pattern growth . Vertical data format approach 2) The Apriori Algorithm  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated or tested.  Method:  Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length k frequent itemsets  Test the candidates against DB  Terminate when no frequent or candidate set can be generated

Page 39 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Ck: Candidate itemset of size k Lk: frequent itemset of size k L1 = {frequent items}; For (k = 1; Lk! =; k++) do begin Ck+1 = candidates generated from Lk; For each transaction t in database do Increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support End Return k Lk;

Implementation of Apriori  Example of Candidate-generation . L3={abc, abd, acd, ace, bcd} . Self-joining: L3*L3 1) abcd from abc and abd 2) acde from acd and ace . Pruning: 1) acde is removed because ade is not in L3 . C4 = {abcd}  The counting supports of candidates is a problem. . The total number of candidates can be very huge . One transaction may contain many candidates

 Method: . Candidate itemsets are stored in a hash-tree . Leaf node of hash-tree contains a list of itemsets and counts . Interior node contains a hash table . Subset function: finds all the candidates contained in a transaction  Major computational challenges . Multiple scans of transaction database . Huge number of candidates . Tedious workload of support counting for candidates  Improving Apriori: general ideas . Reduce passes of transaction database scans

Page 40 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

. Shrink number of candidates Facilitate support counting of candidates

3) Mining Various Kinds of Association Rules  Mining Frequent Patterns Using FP-tree  Mining Multilevel Association Rules  Mining Multidimensional Association Rules from Relational Databases and Data Warehouses Mining Frequent Patterns Using FP-tree • In General it is divide-and-conquer – Recursively grow frequent pattern path using the FP-tree – Basic operation is counting and FP-tree building • Method – For each item, construct its conditional pattern-base, and then its conditional FP-tree – Repeat the process on each newly created conditional FP-tree – Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-tree 1) Construct conditional pattern base for each node in the FP-tree 2) Construct conditional FP-tree from each conditional pattern-base 3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far . If the conditional FP-tree contains a single path, simply enumerate all the patterns

Page 41 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Principles of Frequent Pattern Growth • Pattern growth property – Let be a frequent item set in DB, B be 's conditional pattern base, and be an item set in B. Then is a frequent item set in DB iff is frequent in B. • ―abcdef‖ is a frequent pattern, if and only if – ―abcde‖is a frequent pattern, and – ―f ‖is frequent in the set of transactions containing―abcde‖ The performance of FP-growth is an order of magnitude faster than A-priori, and is also faster than tree- projection because of – No candidate generation, no candidate test – Use compact data structure – Eliminate repeated database scan

Mining Multilevel Association Rules Suppose given the task-relevant set of transactional data in Table for sales in an All Electronics store, showing the items purchased for each transaction. The concept hierarchy for the items is shown in Figure.  A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher level, more general concepts.  Data can be generalized by replacing low-level concepts within the data by their higher-level concepts, or ancestors, from a concept hierarchy.

Figure concept hierarchy for All Electronics computer items.

 Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules.

Page 42 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence frame work. In general, a top-down strategy is employed, for each level, any algorithm, such as A-priori or its variations.  Using uniform minimum support for all levels (referred to as uniform support): The same minimum support threshold is used when mining at each level of abstraction . For example, inFigure5.11,a minimum support threshold of 5% is used throughout (e.g., for mining from “computer”downto“laptopcomputer”).Both“computer”and“laptopcomputer”are found to be frequent, while “desktop computer” is not.  When a uniform minimum support threshold is used, the search procedure is simplified.  The method is also simple in that users are required to specify only one minimum support threshold.  An A-priori like optimization technique can be adopted, based on the knowledge that an ancestor is a super set of its descendants: These arches avoids examining item sets containing any item whose ancestors do not have minimum support.

Mining Multidimensional Association Rules from Relational Databases and Data Warehouses  The association rules that imply a single predicate, that is, the predicate buys.  For instance, in mining our All Electronics database, we may discover the Boolean association rule

 Following the terminology used in multidimensional databases, that refer to each distinct predicate in a rule dimension.  Hence, the Rule above as a single dimensional or intra dimensional association rule because it contains a single distinct predicate (e.g., buys)with multiple occurrences(i.e., the predicate occurs more than once withi n the rule).  Considering each database attribute or warehouse dimension as a predicate, therefore mine association rules containing multiple predicates, such as

Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes  The database attributes can be categorical or quantitative.

Page 43 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Categorical attributes have a finite number of possible values, with no ordering among the values  Categorical attributes are also called nominal attributes, because their values are― names of things.  Quantitative attributes are numeric and have an implicit ordering among values  Techniques for mining multidimensional association rules can be categorized into two basic approaches regarding the treatment of quantitative attributes.  Quantitative attributes, are discretized before mining using predefined concept hierarchies or data discretization techniques, where numeric values are replaced by interval labels.  Categorical attributes may also be generalized to higher conceptual levels if desired. If the resulting task- relevant data are stored in a relational table, the many of the frequent item set mining.  Instead of searching on only one attribute like buys, there is need to search through all of the relevant attributes, treating each attribute-value pair as an item set.

To mine quantitative association rules having two quantitative attributes on the left-hand side of the rule and one categorical attribute on the right-hand side of the rule. That is,

Where A quan1 and A quan2 are test so n quantitative attribute intervals (where the intervals are dynamically determined), and A cat tests a categorical attribute from the task-relevant data. Such rules have been referred to as two-dimensional quantitative association rules. The association relationship between pairs of quantitative attributes, like customer age and income, and the type of television (such as high-definition TV, i.e. , HDTV)that customers like to buy. An example of such a 2-D quantitative association rule is

Page 44 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

4.2. Classification and Prediction 4) What Is Classification? What Is Prediction?  Classification: – predicts categorical class labels – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Prediction – models continuous-valued functions, i.e., predicts unknown or missing values

 Typical applications – Credit approval – Target marketing – Medical diagnosis – Fraud detection

Classification: Basic Concepts  Supervised learning (classification) . Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations a new data is classified based on the training set  Unsupervised learning (clustering) . The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Classification vs. Numeric Prediction  Classification . predicts categorical class labels (discrete or nominal) . classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Numeric Prediction . models continuous-valued functions, i.e., predicts unknown or missing values

Classification—A Two-Step Process  Model construction: describing a set of predetermined classes – Each tuple/ sample is assumed to belong to a predefined class, as

Page 45 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

determined by the class label attribute – The set of tuples used for model construction: training set – The model is represented as classification rules, decision trees, or mathematical formulae

 Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur

Process (1): Model Construction

Process (2): Using the Model in Prediction

Page 46 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

5) Issues Regarding Classification and Prediction  Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values(e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics).  Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related.

 Data transformation and reduction: The data may be transformed by normalization, particularly when neural network some methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attributes so that they fall within a small specified range, such as-1.0 to1.0, or 0.0to1.0.In methods that use distance. Classification and prediction methods can be compared and evaluated according to the following criteria:  Accuracy  Speed  Robustness  Scalability  Interpretability

Page 47 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

6) Classification by Decision Tree Induction : An Example  Training data set: Buys_computer age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

 The data set follows an example of Quinlan’s ID3 (Playing Tennis)  Resulting tree:

Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) . Tree is constructed in a top-down recursive divide-and-conquer manner . At start, all the training examples are at the root . Attributes are categorical (if continuous-valued, they are discredited in advance)

. Examples are partitioned recursively based on selected attributes . Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

 Conditions for stopping partitioning . All samples for a given node belong to the same class . There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf

Page 48 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

. There are no samples left

7) Bayesian Classification : Bayes’ Theorem  A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities  Foundation: Based on Bayes’ Theorem.  Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured  Total probability Theorem: M P(B)  P(B| A )P(A ) i i 1 i  Bayes’ Theorem: . Let X be a data sample (“evidence”): class label is unknown . Let H be a hypothesis that X belongs to class C . Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X . P(H) (prior probability): the initial probability E.g., X will buy computer, regardless of age, income, . P(X): probability that sample data is observed . P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds E.g., given that X will buy computer, the prob. that X is 31...40, medium income

Prediction Based on Bayes’ Theorem  Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem P(H |X) P(X|H)P(H)  P(X|H)P(H)/ P(X) P(X)  Informally, this can be viewed as Posteriori = likelihood x prior/evidence

 Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes  Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost

Page 49 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

8) Naïve Bayesian Classification  A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): n P(X | C )   P(x | C )  P(x | C ) P(x | C )... P(x | C ) i k i 1 i 2 i n i k 1  This greatly reduces the computation cost: Only counts the class distribution

 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)

 If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviationP (σX | )  g(x , , ) Ci k Ci Ci (x)2 and P(xk|Ci) is  1 2 age income studentcredit_ratingbuys_computer g(x, , )  e 2 2 <=30 high no fair no Class: <=30 high no excellent no C1:buys_computer = ‘yes’ 31…40 high no fair yes >40 medium no fair yes C2:buys_computer = ‘no’ >40 low yes fair yes Data to be classified: >40 low yes excellent no X = (age <=30, 31…40 low yes excellent yes <=30 medium no fair no Income = medium, <=30 low yes fair yes Student = yes >40 medium yes fair yes Credit_rating = Fair) <=30 medium yes excellent yes 31…40 medium no excellent yes  P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 31…40 high yes fair yes P (buys_computer = “no”) = 5/14= 0.357 >40 medium no excellent no

 Compute P(X|Ci) for each class P (age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P (age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P (income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P (income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P (student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P (student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P (credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P (credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4  X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P (X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

Page 50 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Advantages . Easy to implement . Good results obtained in most of the cases  Disadvantages . Assumption: class conditional independence, therefore loss of accuracy . Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes Classifier

9) Rule Based Classification : Using IF-THEN Rules for Classification  Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes . Rule antecedent/precondition vs. rule consequent  Assessment of a rule: coverage and accuracy . ncovers = # of tuples covered by R . ncorrect = # of tuples correctly classified by R coverage (R) = ncovers /|D| /* D: training data set */ accuracy (R) = ncorrect / ncovers  If more than one rule are triggered, need conflict resolution . Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most attribute tests) . Class-based ordering: decreasing order of prevalence or misclassification cost per class . Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts

10) Rule Extraction from a Decision Tree  Rules are easier to understand than large trees  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction  Rules are mutually exclusive and exhaustive

Page 51 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes

4.3 Classification by Backpropagation: 11) A Multilayer Feed-Forward Neural Network

 The inputs to the network correspond to the attributes measured for each training tuple  Inputs are fed simultaneously into the units making up the input layer  They are then weighted and fed simultaneously to a hidden layer  The number of hidden layers is arbitrary, although usually only one  The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction  The network is feed-forward: None of the weights cycles back to an input unit or to an output unit of a previous layer

Page 52 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function 12) Defining a Network Topology  Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer  Normalize the input values for each attribute measured in the training tuples to [0.0—1.0]  One input unit per domain value, each initialized to 0  Output, if for classification and more than two classes, one output unit per class is used  Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights 13) Backpropagation  Iteratively process a set of training tuples & compare the network's prediction with the actual known target value  For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value  Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”  Steps . Initialize weights to small random numbers, associated with biases . Propagate the inputs forward (by applying activation function) . Backpropagate the error (by updating weights and biases) . Terminating condition (when error is very small, etc.)

14) Prediction: Linear Regression Prediction  (Numerical) prediction is similar to classification . construct a model

Page 53 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

. use model to predict continuous or ordered value for a given input  Prediction is different from classification . Classification refers to predict categorical class label . Prediction models continuous-valued functions  Major method for prediction: regression . model the relationship between one or more independent or predictor variables and a Dependent or response variable  Regression analysis . Linear and multiple regression . Non-linear regression . Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees  Linear regression: involves a response variable y and a single predictor variable x  y=w0 +w1 x  where w0 (y-intercept) and w1 (slope) are regression coefficients  Method of least squares: estimates the best-fitting straight line . Multiple linear regression: involves more than one predictor variable . Training data is of the form (X1,y1),(X2,y2),…,(X|D|, y|D|) . Ex. For2-D data, we may have: y=w0 +w1 x1+w2 x2 . Solvable by extension of least square method or using SAS, S-Plus . Many non linear functions can be transformed into the above

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) SVM—When Data Is Linearly Separable  A separating hyperplane can be written as

Page 54 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

W ● X + b = 0 Where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)  For 2-D it can be written as w0 + w1 x1 + w2 x2 = 0  The hyperplane defining the sides of the margin: H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1  Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors  This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers SVM is Effective on High Dimensional Data  The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data  The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)  If all other training examples are removed and the training is repeated, the same separating hyperplane would be found  The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality  Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high

15) Nonlinear Regression  Some nonlinear models can be modeled by a polynomial function  A polynomial regression model can be transformed into linear regression model. 2 3 For example, y=w0 +w1 x+ w2x +w3 x  convertible to linear with new variables: x2=x2,x3=x3  y=w0 +w1 x +w2x2+w3 x3

Page 55 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Other functions, such as power function, can also be transformed to linear model  Some models are intractable nonlinear (e.g., sum of exponential terms)possible to obtain least square estimates through extensive calculation on more complex formulae.

UNIT V CLUSTER ANALYSIS

1) What Is Cluster Analysis?  Cluster: A collection of data objects  similar (or related) to one another within the same group  dissimilar (or unrelated) to the objects in other groups  Cluster analysis (or clustering, data segmentation, …)  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms

Page 56 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Clustering as a Preprocessing Tool (Utility) for  Summarization:  Preprocessing for regression, PCA, classification, and association analysis  Compression:  Image processing: vector quantization  Finding K-nearest Neighbours  Localizing search to one or a small number of clusters  Outlier detection  Outliers are often viewed as those “far away” from any cluster

2) Categorization of Major clustering Methods The Major Clustering Methods are  Partitioning approach:  Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors  Typical methods: k-means, k-medoids, CLARANS  Hierarchical approach:  Create a hierarchical decomposition of the set of data (or objects) using some criterion  Typical methods: Diana, Agnes, BIRCH, CAMELEON  Density-based approach:  Based on connectivity and density functions  Typical methods: DBSACN, OPTICS, DenClue  Grid-based approach:  based on a multiple-level granularity structure  Typical methods: STING, WaveCluster, CLIQUE  Model-based:  A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other  Typical methods: EM, SOM, COBWEB  Frequent pattern-based:  Based on the analysis of frequent patterns  Typical methods: p-Cluster  User-guided or constraint-based:  Clustering by considering user-specified or application-specific constraints  Typical methods: COD (obstacles), constrained clustering  Link-based clustering:  Objects are often linked together in various ways  Massive links can be used to cluster objects: SimRank, LinkClus

Page 57 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

3) Partitioning Methods : K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the centre, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when the assignment does not change

Page 58 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 In general, K-Means often terminates at a local optimal.  Advantage: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))  Limitations  Applicable only to objects in a continuous n-dimensional space  Using the k-modes method for categorical data  In comparison, k-medoids can be applied to a wide range of data  Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)  Sensitive to noisy data and outliers  Not suitable to discover clusters with non-convex shapes

4) Hierarchical Methods: Agglomerative and Divisive Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Page 59 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

AGNES (Agglomerative Nesting):  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical packages, e.g., Splus  Use the single-link method and the dissimilarity matrix  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster

Dendrogram: . Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram . A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

DIANA (Divisive Analysis)  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES  Eventually each node forms a cluster on its own

Distance between Clusters  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)  Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)

Page 60 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) a chosen, centrally located object in the cluster Centroid, Radius and Diameter of a Cluster (for numerical data sets)  Centroid: the “middle” of a cluster N (t ) C  i 1 ip m N  Radius: square root of average distance from any point of the cluster to its centroid N (t c )2 i 1 ip m Rm  N

 Diameter: square root of average mean squared distance between all pairs of points in the cluster

N N (t t )2 i 1 i 1 ip iq Dm  N(N 1)

5) Density-Based Methods-DBSCAN  Clustering based on density (local cluster criterion), such as density-connected points

 Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition  Several interesting studies:  DBSCAN: Ester, et al. (KDD’96)  OPTICS: Ankerst, et al (SIGMOD’99).  DENCLUE: Hinneburg & D. Keim (KDD’98)  CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

Density-Based Clustering:  Two parameters:  Eps: Maximum radius of the neighbourhood

Page 61 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 MinPts: Minimum number of points in an Eps-neighbourhood of that point  NEps(p): {q belongs to D | dist(p,q) ≤ Eps}  Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

 p belongs to NEps(q)  core point condition: |NEps (q)| ≥ MinPts

 Density-reachable:  A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density- reachable from pi  Density-connected  A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts DBSCAN: Density-Based Spatial Clustering  Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points  Discovers clusters of arbitrary shape in spatial databases with noise

Page 62 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Algorithm:  Arbitrary select a point p  Retrieve all points density-reachable from p w.r.t. Eps and MinPts  If p is a core point, a cluster is formed  If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database  Continue the process until all of the points have been processed

6) Data Mining Applications

 Data mining: A young discipline with broad and diverse applications  There still exists a nontrivial gap between generic data mining methods and effective and scalable data mining tools for domain-specific applications  Some application domains (briefly discussed here)  Data Mining for Financial data analysis  Data Mining for Retail and Telecommunication Industries  Data Mining in Science and Engineering  Data Mining for Intrusion Detection and Prevention  Data Mining and Recommender Systems

Data Mining for Financial Data Analysis  Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality  Design and construction of data warehouses for multidimensional data analysis and data mining  View the debt and revenue changes by month, by region, by sector, and by other factors  Access statistical information such as max, min, total, average, trend, etc.  Loan payment prediction/consumer credit policy analysis  feature selection and attribute relevance ranking  Loan payment performance  Consumer credit rating  Classification and clustering of customers for targeted marketing  multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group  Detection of money laundering and other financial crimes  integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs)

Page 63 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)

Data Mining for Retail and Telecommunication Industries  Retail industry: huge amounts of data on sales, customer shopping history, e- commerce, etc.  Applications of retail data mining  Identify customer buying behaviours  Discover customer shopping patterns and trends  Improve the quality of customer service  Achieve better customer retention and satisfaction  Enhance goods consumption ratios  Design more effective goods transportation and distribution policies  Telecommunication and many other industries: Share many similar goals and expectations of retail data mining  Data Mining Practice for Retail Industry  Design and construction of data warehouses  Multidimensional analysis of sales, customers, products, time, and region  Analysis of the effectiveness of sales campaigns  Customer retention: Analysis of customer loyalty  Use customer loyalty card information to register sequences of purchases of particular customers  Use sequential pattern mining to investigate changes in customer consumption or loyalty  Suggest adjustments on the pricing and variety of goods  Product recommendation and cross-reference of items  Fraudulent analysis and the identification of usual patterns  Use of visualization tools in data analysis

Data Mining in Science and Engineering  Data warehouses and data preprocessing  Resolving inconsistencies or incompatible data collected in diverse environments and different periods (e.g. eco-system studies)  Mining complex data types  Spatiotemporal, biological, diverse semantics and relationships  Graph-based and network-based mining  Links, relationships, data flow, etc.  Visualization tools and domain-specific knowledge  Other issues  Data mining in social sciences and social studies: text and social media

Page 64 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

 Data mining in computer science: monitoring systems, software bugs, network intrusion

Data Mining for Intrusion Detection and Prevention  Majority of intrusion detection and prevention systems use  Signature-based detection: use signatures, attack patterns that are preconfigured and predetermined by domain experts  Anomaly-based detection: build profiles (models of normal behavior) and detect those that are substantially deviate from the profiles  What data mining can help  New data mining algorithms for intrusion detection  Association, correlation, and discriminative pattern analysis help select and build discriminative classifiers  Analysis of stream data: outlier detection, clustering, model shifting  Distributed data mining  Visualization and querying tools

Data Mining and Recommender Systems  Recommender systems: Personalization, making product recommendations that are likely to be of interest to a user  Approaches: Content-based, collaborative, or their hybrid  Content-based: Recommends items that are similar to items the user preferred or queried in the past  Collaborative filtering: Consider a user's social environment, opinions of other customers who have similar tastes or preferences  Data mining and recommender systems  Users C × items S: extract from known to unknown ratings to predict user- item combinations  Memory-based method often uses k-nearest neighbour approach  Model-based method uses a collection of ratings to learn a model (e.g., probabilistic models, clustering, Bayesian networks, etc.)  Hybrid approaches integrate both to improve performance (e.g., using ensemble)

Page 65 of 66 STUDY MATERIAL FOR B.SC.CS DATAWARE HOUSING AND MINING SEMESTER - VI, ACADEMIC YEAR 2020-21

Retail and Financial telecommunication Science and data industries engineering analysis

Data mining applications

Intrusion Recommender detection and systems prevention

Page 66 of 66